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1 

Bayesian Decision Theory 



During the next few lectures we will be looking at the inference from training 
data problem as a random process modeled by the joint probability distribu- 
tion over input (measurements) and output (say class labels) variables. In 
general, estimating the underlying distribution is a daunting and unwieldy 
task, but there are a number of constraints or "tricks of the trade" so to 
speak that under certain conditions make this task manageable and fairly 
effective. 

To make things simple, we will assume a discrete world, i.e., that the 
values of our random variables take on a finite number of values. Consider 
for example two random variables X taking on k possible values xi,...,Xk 
and H taking on two values hi, h2- The values of X could stand for a Body 
Mass Index (BMI) measurement weight/ height^ of a person and H stands 
for the two possibilities hi standing for the "person being over- weight" and 
/i2 as the possibility "person of normal weight". Given a BMI measurement 
we would like to estimate the probability of the person being over-weight. 

The joint probability P{X, H) is a two dimensional array (2-way array) 
with 2k entries (cells). Each training example (xi, hj) falls into one of those 
cells, therefore P{X = Xi,H = hj) = P{xi, hj) holds the ratio between the 
number of hits into cell (z, j) and the total number of training examples 
(assuming the training data arrive i.i.d.). As a result ^j^- P{xj, hj) = 1. 

The projections of the array onto its vertical and horizontal axes by sum- 
ming over columns or over rows is called marginalization and produces 
P{hj) = J2i -f (a^i) hj) the sum over the j'th row is the probability P{H = hj), 
i.e., the probability of a person being over- weight (or not) before we see any 
measurement — these are called priors. Likewise, P{xi) = ^j P{xi,hj) 
is the probability P(X = Xi) which is the probability of receiving such 
a BMI measurement to begin with — this is often called evidence. Note 
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Fig. 1.1. Joint probability P{X,H) where X ranges over 5 discrete values and H 
over two values. Each entry contains the number of hits for the cell {xi,hj). The 
joint probability P{xi, hj) is the number of hits divided by the total number of hits 
(22). See text for more details. 



that, by definition, Ylj-^i^j) — = 1- -^^S- ^-^ have that 

P{h\) = 14/22, -P(/i2) = 8/22 that is there is a higher prior probability of a 
person being over-weight than being of normal weight. Also ^(2:3) = 7/22 
is the highest meaning that we encounter BMI = X3 with the highest prob- 
ability. 

The conditional probability P{hj \ Xi) = P{xi, hj)/ P{xi) is the ratio be- 
tween the number of hits in cell (i, j) and the number of hits in the i'th 
column, i.e., the probability that the outcome is H = hj given the measure- 
ment X = Xi. In Fig. |1.1| we have P(/i2 | 2:3) = 3/7. Note that 

Y^p{h, I X.) = = = n-^)inx^) = i. 

i j j 

Likewise, the conditional probability P{xi \ hj) = P(xi,hj)/P{hj) is the 
number of hits in cell (i, j) normalized by the number of hits in the j'th row 
and represents the probability of receiving BMI = Xi given the class label 
H = hj (over- weight or not) of the person. In Fig. |l.l| we have P{x3 \ /12) = 
3/8 which is the probability of receiving BMI = X3 given that the person is 
known to be of normal weight. Note that Y^- P{xi \ hj) = 1. 
The Bayes formula arises from: 

P[Xi I hj)P[hj) = P{Xi,hj) = P{hj I Xi)P[Xi), 

from which we get: 

The left hand side P{hj \ Xi) is called the posterior probability and P{xi \ hj) 
is caUed the class conditional likelihood. The Bayes formula provides a 
way to estimate the posterior probability from the prior, evidence and class 
likelihood. It is useful in cases where it is natural to compute (or collect 
data of) the class likelihood, yet it is not quite simple to compute directly 
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the posterior. For example, given a measurement "12" we would like to 
estimate the probability that the measurement came from tossing a pair 
of dice or from spinning a roulette table. If x = 12 is our measurement, 
and hi stands for "pair of dice" and /12 for "roulette" then it is natural 
to compute the class conditional: P("12" | "pair of dice") = 1/36 and 
P("12" I "roulette") = 1/38. Computing the posterior directly is much 
more difficult. As another example, consider medical diagnosis. Once it is 
known that a patient suffers from some disease hj, it is natural to evaluate 
the probabilities P{xi \ hj) of the emerging symptoms Xj. As a result, in 
many inference problems it is natural to use the class conditionals as the 
basic building blocks and use the Bayes formula to invert those to obtain 
the posteriors. 

The Bayes rule can often lead to unintuitive results — the one in particu- 
lar is known as " base rate fallacy" which shows how an nonuniform prior can 
influence the mapping from likelihoods to posteriors. On an intuitive basis, 
people tend to ignore priors and equate likelihoods to posteriors. The follow- 
ing example is typical: consider the "Cancer test kit" problenjf] which has the 
following features: given that the subject has Cancer "C", the probability 
of the test kit producing a positive decision "-I-" is P{+ \ C) = 0.98 (which 
means that P{— \ C) = 0.02) and the probability of the kit producing a neg- 
ative decision "-" given that the subject is healthy "H" is P{— \ H) = 0.97 
(which means also that P{+ \ H) = 0.03). The prior probability of Cancer 
in the population is P{C) = 0.01. These numbers appear at first glance 
as quite reasonable, i.e, there is a probability of 98% that the test kit will 
produce the correct indication given that the subject has Cancer. What 
we are actually interested in is the probability that the subject has Cancer 
given that the test kit generated a positive decision, i.e., P{C \ +). Using 
Bayes rule: 

pic I +) = ' ^)-^(^) = ^(+ I c)nc) ^ 

^ ' ^ P(+) P(+ I C)P{C) + P[+ I H)P{H) 

which means that there is a 26.6% chance that the subject has Cancer given 
that the test kit produced a positive response — by all means a very poor 
performance. 

If we draw the posteriors P{hi \x) and P(/i2 I x) using the probability 
distribution array in Fig. |1.1| we will see that P{hi \x) > P{h2 \ x) for all 
values of X smaller than a value which is in between X3 and X4. Therefore 
the decision which will minimize the probability of misclassification would 



t This example is adopted from Yishai Mansour's class notes on Machine Learning. 
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be to choose the class with the maximal posterior: 

h* = argmaxP(/ij | x), 

3 

which is known as the Maximal A Posteriori (MAP) decision principle. Since 
P{x) is simply a normalization factor, the MAP principle is equivalent to: 

h* = argmaxP(x | hj)P{hj). 
j 

In the case where information about the prior P{h) is not known or it is 
known that the prior is uniform, the we obtain the Maximum Likelihood 
(ML) principle: 

h* = argmaxP(a; | hj). 

j 

The MAP principle is a particular case of a more general principle, known 
as "proper Bayes", where a loss is incorporated into the decision process. 
Let l{hi, hj) be the loss incurred by deciding on class hi when in fact hj is 
the correct class. For example, the "0/1" loss function is: 

The least-squares loss function is: Z(/ij, hj) = \\hi — hj |p typically used when 
the outcomes are vectors in some high dimensional space rather than class 
labels. We define the expected risk: 

R{hi I x) = '^^l{hi, hj)P{hj I x). 
j 

The proper Bayes decision policy is to minimize the expected risk: 

h* = argmin R{hj \ x). 
j 

The MAP policy arises in the case l{hi,hj) is the 0/1 loss function: 
R{hi \x) = Y^ P{hj \x) = l- P{hi I x), 

Thus, 

argmin i?(/ij | x) = argmaxP(/ij | x). 
j 3 
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1.1 Independence Constraints 

At this point we may pause and ask what have we obtained? weh, not 
much. Clearly, the inference problem is captured by the joint probability 
distribution and we do not need all these formulas to see this. How do 
we obtain the necessary data to fill in the probability distribution array to 
begin with? Clearly without additional simplifying constraints the task is 
not practical as the size of these kind of arrays are exponential in the number 
of variables. There are three families of simplifying constraints used in the 
literature: 

• statistical independence constraints, 

• parametric form of the class likelihood P{xi \ hj) where the inference 
becomes a density estimation problem, 

• structural assumptions — latent (hidden) variables, graphical models. 

Today we will focus on the first of these simplifying constraints — statistical 
independence properties. 

Consider two random variables X and Y. The variables are statistically 
independent X1.Y if P(X \ Y) = P{X) meaning that information about 
the value of Y does not add anything about X. The independence condition 
is equivalent to the constraint: P{X,Y) = P(X)P(Y). This can be easily 
proven: if X±Y then P{X,Y) = P{X \ Y)P{Y) = P{X)P{Y). On the 
other hand, if P{X,Y) = P{X)P{Y) then 

P(X I Y\ - - ^(^)-P(^) - P(x\ 

nX I Y) - - ^(^^ - P{X). 

Let the values of X range over and the values of Y range over 

yi,...,yi. The associated k x / 2-way array, P{X = Xi,Y = yj) is repre- 
sented by the outer product P{xi, yj) = P{xi)P{yj) of two vectors P{X) = 
{P{xi),...,P{xk)) and P{Y) = {P{yi), P{yi)). In other words, the 2-way 
array viewed as a matrix is of rank 1 and is determined by A; -|- / (minus 2 
because the sum of each vector is 1) parameters rather than kl (minus 1) 
parameters. 

Likewise, if XiJ-X2-\ LX„ are n statistically independent random vari- 
ables where Xi ranges over ki discrete and distinct values, then the n-way 
array P{Xi, = P{Xi) ■ ... • P{Xn) is an outer-product of n vectors 
and is therefore determined by ki + ... + kn (minus n) parameters instead 
of kik2---kn (minus 1) parameter^ Viewed as a tensor, the joint probabil- 

t I am a bit over simplifying things because we are ignoring here the fact that the entries of 
the array should be non-negative. This means that there are additional non-linear constraints 
which effectively reduce the number of parameters — but nevertheless it stays exponential. 
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ity is a rank 1 tensor. The main point is that the statistical independence 
assumption reduced the representation of the multivariate joint distribution 
from exponential to linear size. 

Since our variables are typically divided to measurement variables and 
an output/class variable H (or in general Hi,..., Hi), it is useful to intro- 
duce another, weaker form, of independence known as conditional indepen- 
dence. Variables X,Y are conditionally independent given H, denoted by 
X±Y I H, iff P{X \Y.H) = P{X 1 H) meaning that given H, the value of Y 
does not add any information about X. This is equivalent to the condition 
P{X,Y I H) = P{X I H)P{Y I H). The proof goes as follows: 

• If P{X \Y,H) = P{X I H), then 



P(YY\M^ P{X,Y,H) _ P{X\Y,H)P{Y,H) 
P{X,Y\H) - p(^) - 

P{X \Y,H)P{Y \ H)P{H) 
P(H) 

If P{X,Y I H) = P{X I H)P{Y I H), then 



= P{X I H)P{Y I H) 



I ^' - P{Y,H) - P{Y I H) - I 

Consider as an example, Joe and Mo live on opposite sides of the city. 
Joe goes to work by train and Mo by car. Let X be the event "Joe is late 
to work" and Y be the event "Mo is late for work". Clearly X and Y are 
not independent because there could be other factors. For example, a train 
strike will cause Joe to be late, but because of the strike there would be 
extra traffic (people using their car instead of the train) thus causing Mo to 
be pate as well. Therefore, a third variable H standing for the event "train 
strike" would decouple X and Y. 

From a computational standpoint, the conditional independence assump- 
tion has a similar effect to the unconditional independence. Let X range 
over k distinct value, Y range over r distinct values and H range over s 
distinct values. Then P{X, Y, H) is a 3-way array of size k x r x s. Given 
that X1.Y I H means that P(X,Y j H = hi), a 2-way "slice" of the 3-way 
array along the H axis is represented by the outer-product of two vectors 
P{X I H = hi)P{Y \ H = hi). As a result the 3-way array is represented by 

s{k-\-r — 2) parameters instead of skr — 1. Likewise, if X\A LX„ | H then 

the n-way array P{Xi, X„ | H = hi) (which is a slice along the H axis of 
the (n + l)-array P[Xi, Xn, H)) is represented by an outer-product of n 
vectors, i.e., by /ci + . . + /c„ — n parameters. 



1.1 Independence Constraints 
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1.1.1 Example: Coin Toss 

We will use the ML principle to estimate the bias of a coin. Let X be a 
random variable taking the value {0, 1} and H would be our hypothesis 
taking a real value in [0, 1] standing for the coin's bias. If the coin's bias is 
q then P{X = \ H = q) = q and P{X = 1 \ H = q) = 1 — q. We receive m 
i.i.d. examples xi, ...,Xjji where Xi G {0, 1}. We wish to determine the value 
of q. Given that xiJ Lxm \ H, the ML problem we must solve is: 

m 

q* = argmaxP(xi, | = = JJ P{xi \ q) = argmax ^ log P{xi \ q). 

1=1 I 

Let < A < m stand for the number of '0' instances, i.e., A = \{xi = | i = 
1, m}\. Therefore our ML problem becomes: 

q* = argmax {Alogg + (n — A) log(l — g)} 
Taking the partial derivative with respect to q and setting it to zero: 

|[Alogg + (n - A) log(l - «)] = ^ - ^ = 0> 
produces the result: 



1.1.2 Example: Gaussian Density Estimation 

So far we considered constraints induced by conditional independent state- 
ments among the random variables as a means to reduce the space and time 
complexity of the multivariate distribution array. Another approach would 
be to assume some form of parametric form governing the entries of the array 
— the most popular assumption is Gaussian distribution P(Xi,...,X„) ~ 
A^(/i, E) with mean vector ^ and covariance matrix E. The parameters of 
the density function are denoted by ^ = (//, E) and for every vector x G i?" 
we have: 

Assume we are given an i.i.d sample of k points S = {xi, ...,Xfe}, Xj G i?", 
and we would like to find the Bayes optimal 9: 

e* = argmax P(5 | 0), 
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by maximizing the likelihood (here we are assuming that the the priors P{0) 
are equal, thus the maximum likelihood and the MAP would produce the 
same result). Because the sample was drawn i.i.d. we can assume that: 

k 

p{s I e) = l[p{^i 1 9). 
1=1 

Let L{9) = log P{S I 9) = ^^logP(xi | 9) and since Log is monotonously- 
increasing we have that 9* = argmaxL(6). The parameter estimation would 

e 

be recovered by taking derivatives with respect to 9, i.e., VgL = 0. We have: 

k 

m = -1 log 1^1 - ^ ^ log(27r) - E ^(^^ - - m). (1.1) 

i=l i 

We will start with a simple scenario where E = cr^/, i.e., all the covariances 
are zero and all the variances are equal to cr^. Thus, E^^ = a^^I and 
\E\ = cr^". After substitution (and removal of items which do not depend 
on 9) we have: 

L(^) = -nHoga-^5^"^^-^"' 

i 

The partial derivative with respect to /j,: 

dL o , 
- = a-^(;.-xO = 

i 

from which we obtain: 



C72 



i=l 

The partial derivative with respect to a is: 



dL nk q 11 1,9 



from which we obtain: 

k 



i=l 



Note that the reason for dividing by n is due to the fact that af 
a^ = a^, so that: 
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In the general case, E is a full rank symmetric matrix, then the derivative 
of eqn. (1.1) with respect to /j, is: 

|^ = i?-i^(/.-x.) = 0, 

i 

and since E^'^ is full rank we obtain fi = (l/A;)^jXj. For the derivative 
with respect to E we note two auxiliary items: 

If = \E\E-\ ^trace{AE~') = -{E-'AE-^)\ 

Using the fact that x^y = trace(xy^) we can transform E^^z to trace{zz^ E^^) 
for any vector z. Given that E~^ is symmetric, then: 

d 

—trace{zz~^ E-^) = -E^^zz^E'^. 
oE 

Substituting z = x — we obtain: 



dL_ 
dE 



-kE-^ + E-^ |^^(x, - ^)(xi - fi^^ E-' 



from which we obtain: 

k 



k 

i=l 



1.2 Incremental Bayes Classifier 

Consider another application of conditional dependence which is the Bayes 
incremental rule. Suppose we have processed n examples X^""^ = {Xi, Xn} 
and computed somehow P{H \ X^^^). We are given a new measurement X 
and wish to compute (update) the posterior P{H \ X^'^\X). We will use 
the chain rul^ 

P(X\Y7) = r^Wl^^ = P{Z\X,Y)PiX\Y)P{Y) ^ P{Z \ X,Y)P{X \ Y) 
^ ' ' ^ P{Y,Z ' P{Z\Y)P{Y) P{Z\Y) 

to obtain: 

^ ' ' ' P{X I XW) 

from conditional independence, P{X \ X^'^\H) = P{X \ H). The term 
P{X I X^'"'^) can expanded as follows: 

t this is based on the rule P(Xi,...,X„) = P{Xi | Xa, X„)P{X2 | X3,...,X„) • ■ • 
P(X„_i I X„)P(X„) 
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p{x\x ) - 2^ — — 



= ^ P{X I H = hi)P{H = hi I 

i 

After substitution we obtain: 

Ej ^(^ I ^ = hj)P{H = hj I XW) 

The old posterior | is now the prior for the updated formula. 

Consider the following exampl^ We have a coin which could be either fair 
or biased towards Head at a probability of 0.6. Let H = hi he the event 
that the coin is fair, and H = h2 that the coin is biased. We start with prior 
probabilities P{hi) = 0.75 and P{h2) = 0.25 (we have a higher initial belief 
that the coin is fair). Suppose our first coin toss is a Head, i.e., Xi = "0". 
Then, 

^ ' ^ P(xi) 0.5*0.75 + 0.6*0.25 

and P(/i2 I a^i) = 0.286. Our posterior belief that the coin is fair has gone 
down after a Head toss. Assume we have another measurement X2 = "0", 
then: 

p.x, I N I hi)P{hi I xi) 0-5*0.714 

P(hi xi,X2) = ; = = 0.675, 

^ ' ' ' normalization 0.5*0.714 + 0.6*0.286 

and -P(/i2 I xi,X2) = 0.325, thus our belief that the coin is fair continues to 
go down after Head tosses. 



1.3 Bayes Classifier for 2-class Normal Distributions 

For the last topic in this lecture consider the 2-class inference problem. We 
will encountered this problem in this course in the context of SVM and 
LDA. In the Bayes framework, if if = {hi, /12} denotes the "class member" 
variable with two possible outcomes, then the MAP decision policy calls for 

f adopted from Ron Rivest's 1994 class notes. 
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making the decision based on data x: 

h* = argmax {P{hi \ x),P(/i2 | x)} , 

or in other words the class hi would be chosen if P{hi \ x) > -P(/i2 | x). 
The decision surface (as a function of x) is therefore described by: 

P{hi I x) - P(/i2 I x) = 0. 

The questions we ask here is what would the Bayes optimal decision sur- 
face be like if we assume that the two classes are normally distributed with 
different means and the same covariance matrix? What we will see is that 
under the condition of equal priors P{hi) = P{h2) the decision surface is 
a hyperplane — and not only that, it is the same hyperplane produced by 
LDA. 



Claim 1 If P{hi) = P(/i2) and P(x | hi) ~ N{fii,E) and P(x | hi 
N{fj,2,E), the the Bayes optimal decision surface is a hyperplane w^(x — 
fi) = where fi = {fii + /i2)/2 and w = E~^{fii — ^2)- In other words, the 
decision surface is described by: 

^'^E-\fii - ^2) - ^(m + f^2)E-\fii - fi2) = 0. (1.2) 

Proof: The decision surface is described by P{hi \ x) — P(/i2 I x) = 
which is equivalent to the statement that the ratio of the posteriors is 1, or 
equivalently that the log of the ratio is zero, and using Bayes formula we 
obtain: 

Pi^\hi)Pihi) P(x|/ii) 

(J = log — — — — — - — r- = log 



P(X I /l2)P(/i2) ^P(x|/l2)' 

In other words, the decision surface is described by 

logP(x I /li)-logP(x I /12) = -^(x-/ii)T^-l(x-/ii) + ^(x-/i2)^^"'(x-/i2) 



After expanding the two terms we obtain eqn. (1.2). [] 
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Maximum Likelihood/ Maximum Entropy Duality 



In the previous lecture we defined the principle of Maximum Likelihood 
(ML): suppose we have random variables Xi, form a random sample 

from a discrete distribution whose joint probability distribution is P(x | cf)) 
where x = (xi, is a vector in the sample and (p is a parameter from 

some parameter space (which could be a discrete set of values say class 
membership). When P(x | cj)) is considered as a function of (j) it is called the 
likelihood function. The ML principle is to select the value of (f) that maxi- 
mizes the likelihood function over the observations (training set) xi, ...,'x^. 
If the observations are sampled i.i.d. (a common, not always valid, assump- 
tion), then the ML principle is to maximize: 

m mm 

(f)* = argmax JJP(xj | cp) = argmaxlog JJ P(xj | (p) = argmax^^ log P(xj | (p) 

't' i=l 1=1 1=1 

which due to the product nature of the problem it becomes more convenient 

to maximize the log likelihood. We will take a closer look today at the 
ML principle by introducing a key element known as the relative entropy 
measure between distributions. 



2.1 ML and Empirical Distribution 

The ML principle states that the empirical distribution of an i.i.d. sequence 
of examples is the closest possible (in terms of relative entropy which would 
be defined later) to the true distribution. To make this statement clear 
let X he s, set of symbols {ai,...,an} and let P{a \ 6) be the probability 

(belonging to a parametric family with parameter 0) of drawing a symbol 
a £ X. Let xi, Xm be a sequence of symbols drawn i.i.d. according to P. 
The occurrence frequency f{a) measures the number of draws of the symbol 
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a: 

f{a) = \{i : Xi = a}\, 
and let the empirical distribution be defined by 

P(a) = ^—/(a) = TT^fia) = (l/m)/(a). 

The joint probability P{xi, ...,Xm \ 4>) is equal to the product W_iP{xi \ (p) 
which according to the definitions above is equal to: 

m 

p{xi,...,xm I </)) = Hpixi I ^) = n I ^y^"^- 

i=l a&X 

The ML principle is therefore equivalent to the optimization problem: 

max Yl P{a I (/))^(") (2.1) 

where Q = {q G i?" : q > 0, = 1} denote the set of n-dimensional 

probability vectors ("probability simplex"). Let pi stand for P{ai \ (p) and 
fi stand for f{ai). Since Siigm.ax^z{x) = argmax^ Inz(x) and given that 
In Y\- p(' = Y^- fi In Pi the solution to this problem can be found by setting 
the partial derivative of the Lagrangian to zero: 



L{p,Kl^) = ^fi'^T^Pi - A(^Pi - 1) - ^ 



where A is the Lagrange multiplier associated with the equality constraint 

"YiPi — 1 = and Hi > are the Lagrange multipliers associated with the 
inequality constraints pi > 0. We also have the complementary slackness 
condition that sets //j = if > 0. 

After setting the partial derivative with respect to pi to zero we get: 

Pi = fi- 

A + Hi 

Assume for now that fi > for i = l,...,n. Then from complementary 

slackness we must have — ^ (because pi > 0). We are left therefore 
with the result pi = (l/A)/j. Following the constraint ^iPi = 1 we obtain 
A = fi- As a result we obtain: P{a | 0) = P{a). In case /j = we could 
use the convention OlnO = and from continuity arrive to pi = 0. 
We have arrived to the following theorem: 



Theorem 1 The empirical distribution estimate P is the unique Maximum 
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Maximum Likelihood/ Maximum Entropy Duality 



Likelihood estimate of the probability model Q on the occurrence frequency 
/()• 

This seems like an obvious result but it actually runs deep because the result 
holds for a very particular (and non-intuitive at first glance) distance mea- 
sure between non-negative vectors. Let dist{{, p) be some distance measure 
between the two vectors. The result above states that: 

P = argmin dist(f, p) s.t. p > 0, N^pj = 1, (2.2) 
P 

for some (family?) of distance measures distO. It turns out that there 
is only on^ such distance measure, known as the relative-entropy, which 
satisfies the ML result stated above. 



2.2 Relative Entropy 

The relative-entropy (RE) measure D(x||y) between two non-negative vec- 
tors X, y G is defined as: 

n 

-^^(x||y) = V'xiln — - V'xi V'yj. 

'H Vi ^ ^ 

1=1 I I 

In the definition we use the convention that In ^ = and based on con- 
tinuity that Oln^ = and xln^ = oo. When x, y are also probability 
vectors, i.e., belong to Q, then L'(xlly) = X^ja^iln|^ is also known as the 
Kullback-Leibler divergence. The RE measure is not a distance metric as 
it is not symmetric, -D(x||y) 7^ Z)(y||x), and does not satisfy the triangle 
inequality. Nevertheless, it has several interesting properties which make it 
a fundamental measure in statistical inference. 

The relative entropy is always non-negative and is zero if and only if 
X = y. This comes about from the log-sum inequality: 



5:xan^>(5:x.)inp| 



Thus, 

-D(x||y) > (^Xi)ln - ^ ^ yj = Sin 3 - x y 



t not exactly — the picture is a b it more complex. Csiszar's 1972 measures: dist{p, f) 



'^i fi4>{Pi/ fi) will satisfy eqn. 2.2 provided that t/)'"^ is an exponential. However, iijst(f, p) 
(parameters positions are switched) will not do it, whereas the relative entropy will satisfy 
eqn. |2.2|regardless of the order of the parameters p, f. 
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But aln(a/6) > a — 6 for a, 6 > iff ln(a/6) > 1 — {b/a) which fohows from 
the inequahty ln(x + 1) > x/{x + 1) (which holds for x > —1 and x ^ 0). 
We can state the following theorem: 



Theorem 2 Let f > 6e the occurrence frequency on a training sample. 
P G Q is a ML estimate iff 

P = argmin D(f||p) s.t. p > 0, > pi = 1. 
P 

Proof: 

D(f| |p) = - J] /anp, + ^ /an i^, - ^ /, + 1, 



and 



argmin D(f| I p) = argmax "S^ filnpi = argmax InTTpJ 

p p V p 



D 

There are two (related) interesting points to make here. First, from the 
proof of Thm. [T] we observe that the non- negativity constraint p > need 
not be enforced - as long as f > (which holds by definition) the closest p 
to f under the constraint ^^Pi = 1 must come out non-negative. Second, 
the fact that the closest point p to f comes out as a scaling of f (which is by 
definition the empirical distribution P) arises because of the relative-entropy 
measure. For example, if we had used a least-squares distance measure 
||f — pIP the result would not be a scaling of f. In other words, we are 
looking for a projection of the vector f onto the probability simplex, i.e., 
the intersection of the hyperplane x^l = 1 and the non-negative orthant 
X > 0. Under relative-entropy the projection is simply a scaling of f (and 
this is why we do not need to enforce non-negativity). Under least-sqaures, 
a projection onto the hyper-plane x^l = 1 could take us out of the non- 



negative orthant (see Fig. 2.1 for illustration). So, relative-entropy is special 
in that regard — it not only provides the ML estimate, but also simplifies 
the optimization proces^ (something which would be more noticeable when 
we handle a latent class model next lecture). 



2.3 Maximum Entropy and Duality ML/MaxEnt 

The relative-entropy measure is not symmetric thus we expect different out- 
comes of the optimization min^,. D(x||y) compared to m.m.y D{x\\y). The lat- 

f The fact that non-negativity "comes for free" does not apply for all class (distribution) models. 
This point would be refined in the next lecture. 



16 Maximum Likelihood/ Maximum Entropy Duality 




Fig. 2.1. Projection of a non-neagtaive vector f onto the hyperplane J2i Xi — 1 = 0. 

Under rclativc-cntropy the projection P is a scaling of f (and thus lives in the 
probability simplex). Under least-squares the projection p2 lives outside of the 
probability simplex, i.e., could have negative coordinates. 

ter of the two, i.e., miap^Q D{Po\\P) , where Pq is some empirical evidence 
and Q is some model, provides the ML estimation. For example, in the 
next lecture we will consider Q the set of low-rank joint distributions (called 
latent class model) and see how the ML (via relative-entropy minimization) 
solution can be found. 

Let H(p) = — X^jPilnpi denote the entropy function. With regard to 
mhix D{x\\y) we can state the following observation: 

Claim 2 

argmin D(p||— 1) = argmaxi7(p). 
peQ " peQ 

Proof: 

-D(pII^I) = ^Pilnpi + (^j3j)ln(n) = ln(n) - H{p), 
i i 

which follows from the condition Ylii Pi = ^- G 

In other words, the closest distribution to uniform is achieved by maxi- 
mizing the entropy. To make this interesting we need to add constraints. 
Consider a linear constraint on p such as aiPi = 13. To be concrete, con- 
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sider a die with six faces thrown many times and we wish to estimate the 
probabihties pi, .■■,P6 given only the average Yli/'-Pi- Say, the average is 3.5 
which is what one would expect from an unbiased die. The Laplace's prin- 
ciple of insufficient reasoning calls for assuming uniformity unless there is 
additional information (a controversial assumption in some cases). In other 
words, if we have no information except that each pi >0 and that ^^Pi = 1 
we should choose the uniform distribution since we have no reason to choose 
any other distribution. Thus, employing Laplace's principle we would say 
that if the average is 3.5 then the most "likely" distribution is the uniform. 
What if P = 4.2? This kind of problem can be stated as an optimization 
problem: 

maxi?(p) s.t.,y^^pi = I, y^aiPi = (3, 

i i 

where Qj = i and /3 = 4.2. We have now two constraints and with the aid 
of Lagrange multipliers we can arrive to the result: 

Pi = exp-(i-^)exp'^"' . 

Note that because of the exponential Pi > and again "non-negativity 
comes for free"[t| Following the constraint YliPi = 1 "we get exp~*^^~^) = 
1 / exp'^"* from which obtain: 

Pi = ^exp'^"", 

where Z (a function of //) is a normalization factor and ^ needs to be set by 
using f3 (see later). There is nothing special about the uniform distribution, 
thus we could be seeking a probability vector p as close as possible to some 
prior probability pg under the constraints above: 

niinL»(p||po) s.t.,Y^pi = 1, Y^aiPi = 

i i 

with the result: 

Pi = ^POi exp^"' . 

We could also consider adding more linear constraints on p of the form: 
Z^i fijPi = bj, j = 1, k. The result would be: 

Pi = — Poiexp^j=i'"^-"^ . 
Probability distributions of this form are called Gihhs Distributions. In 

t Any measure of the class dist(p, Pq) = ^iPOifpiPi/POi) minimized under linear constraints 
will satisfy the result of > provided that <f>'~^ is an exponential. 
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practical applications the linear constraints on p could arise from average 
information about the system such as temperature of a fluid (where pi are 
the probabilities of the particles moving at various velocities), rainfall data 
or general environmental data (where pi represent the probability of finding 
animal colonies at discrete locations in a 3D map). A constraint of the 
form YlifijPi = states that the expectation Ep[fj] should be equal to 
the empirical distribution P = Ep[fj] where P is either uniform or given as 
input. Let 

V = {peR'' : p > 0, Y^pi = 1, Epifj] = Ep[fj],j = 1, ...,k}, 

i 

and 

Q = {q G i?" ; q is a Gibbs distribution} 

We could therefore consider looking for the ML solution for the parameters 
//I, /Hfc of the Gibbs distribution: 

min D(p\\a), 

where if p is uniform then minD(p||q) can be replaced by max^^ln^j 
(because Z)((l/n)ljjx) = — ln(n) — Yli^^^i)- 

As it turns out, the MaxEnt and ML are duals of each other and the 
intersection of the two sets V D Q contains only a single point which solves 
both problems. 

Theorem 3 The following are equivalent: 

• MaxEnt: q* = an/TOinpgpD(p||po) 

• ML: q* = argminq^^QD {p\\q) 

• €[* eVnQ 

In practice, the duality theorem is used to recover the parameters of the 
Gibbs distribution using the ML route (second line in the theorem above) 
— the algorithm for doing so is known as the iterative scaling algorithm 
(which we will not get into) . 
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In Lecture [2]we saw that the Maximum Likehhood (ML) principle over i.i.d. 
data is achieved by minimizing the relative entropy between a model Q and 
the occurrence- frequency of the training data. Specifically, let x^^, ..,x^ be 
i.i.d. where each Xj G X'^ is a d-tupple of symbols taken from an alphabet X 
having n different letters {ai, On}. Let P be the empirical joint distribu- 
tion, i.e., an array with d dimensions where each axis has n entries, i.e., each 
entry Pi-^^...^i^^ where ij = 1, ...,n, represents the (normalized) co-occurrence 
of the d-tupe ai-^,...,ai^ in the training set xi,...,Xm. We wish to find a 
joint distribution P* (also a d-array) which belongs to some model family 
of distributions Q closest as possible to P in relative-entropy: 

P* = argminD(P||P). 

PeQ 

In this lecture we will focus on a model of distributions Q which represents 
mixtures of simple distributions TC — known as latent class models. A latent 
class model arises when the joint probability P{Xi, ...,Xrf) we observe (i.e., 
from which P is generated by observing samples xi,...,Xm) is in fact a 
marginal of P{Xi, Xii,Y) where y is a "hidden" (or "latent") random 
variable which has k different discrete values ai, ..,afc. Then, 

k 

P{Xi,...,Xd) = Y,PiXu-,Xd \Y = aj)P{Y = aj). 
j=i 

The idea is that given the value of the hidden variable H the problem of 
recovering the model P{Xi, X^ \ Y = aj), which belongs to some family 
of joint distributions TC, is a relatively simple problem. To make this idea 
clearer we consider the following example: Assume we have two coins. The 
first coin has a probability of heads ("0") equal to p and the second coin 
has a probability of heads equal to q. At each trial we choose to toss coin 1 
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with probability A and coin 2 with probabiHty 1 — A. Once a coin has been 
chosen it is tossed 3 times, producing an observation x G {0, 1}^. We are 
given a set of such observations D = {xi, ...,x„j} where eacli observation Xj 
is a triplet of coin tosses (the same coin). Given D, we can construct the 
empirical distribution P which is a 2 x 2 x 2 array defined as: 

Ph,i2,i3 = = {n,i2,«3}, i = 1, ■■■,m}\. 

Let Ui G {1, 2} be a random variable associated with the observation Xj such 
that r/j = 1 if Xj was generated by coin 1 and = 2 if Xj was generated 
by coin 2. If we knew the values of yi then our task would be simply 
to estimate two separate Bernoulli distributions by separating the triplets 
generated from coin 1 from those generated by coin 2. Since is not known, 
we have the marginal: 

P(x = (xi,X2,a;3)) = P(x = (xi,X2,X3) I y = l)P(y = 1) 
+ P(x= (xi,X2,X3) \y = 2)P{y = 2) 
= Ap"Hl + (1 - A)g"Hl - g)(3-«')(3.1) 

where {xi,X2,X3) € {0,1}^ is a triplet coin toss and < < 3 is the 
number of heads ("0" ) in the triplet of tosses. In other words, the likelihood 
P(x) of triplet of tosses x = {xi,X2,X3) is a linear combination ("mixture") 
of two Bernoulli distributions. Let H stand for Bernoulli distributions: 

n 

H = {vi®^ : u > 0, = 1} 

1=1 

where u®^ stands for the outer- product of u G i?" with itself d times, i.e., 
an n- way array indexed by n, where ij G {1, ...,n}, and whose value 
there is equal to Ui^ ■ ■ ■ Ui^. The model family Q is a mixture of Bernoulli 
distributions: 

k 

Q = {Y^^^Pj ■■ A > 0, ^ A, = 1, P, G H}, 
where specifically for our coin-toss example becomes: 

(\ 03 / \ 03 

We see therefore that the eight entries of P* G Q which minimizes D{P\\P) 
over the set Q is determined by three parameters X,p,q. For the coin-toss 
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example this looks like: 



argmin I? P || A 
0<A,p,g<l \ 

1 1 



= argmax 
0<A,p,g<i 




«1=0 12=0 13=0 



where n^^jg = i\ + i2 + Trying to work out an algorithm for minimizing 
the unknown parameters A,p, q would be somewhat "unpleasant" (and even 
more so for other families of distributions 7i) because of the log-over-a-sum 
present in the optimization function — if we could somehow turn this into 
a sum-over-log our task would be much easier. We would then be able to 
turn the problem into a succession of problems over 7i rather than a single 
problem over Q = XjTC. Another point worth attention is the non- 
negativity of the output variables — simply minimizing the relative-entropy 
measure under the constraints of the class model Q would not guarantee a 
non-negative solution. As we shall see, breaking down the problem into a 
successions of problems over H would give us the " non-negativity for free" 
feature. 

The technique for turning the log-over-sum into a sum-over-log as part of 
finding the ML solution for a mixture model is known as the Expectation- 
Maximization (EM) algorithm introduced by Dempster, Laird and Rubin in 
1977. It is based on two ideas: (i) introduce auxiliary variables, and (ii) use 
of Jensen's inequality. 



Let D = {xi, represent the training data where Xj € A' is taken from 

some instance space X which we leave unspecified. For now, we leave matters 
to be as general as possible and specifically we do not make independence 
assumptions on the data generation process. 

The ML problem is to find a setting of parameters 9 which maximizes 
the likelihood -P(xi, ...,Xm | 6), namely, we wish to maximize P{D \ 9) over 
parameters 9, which is equivalent to maximizing the log-likelihood: 



where y represents the hidden variables. We will denote L{9) = log P{D | 9). 



3.1 The EM Algorithm: General 



9* = argmax log P{D | 6*) = log ^ P{D, y 
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Let q{y | D, 6) be some (arbitrary) distribution of the hidden variables y con- 
ditioned on the parameters and the input sample D, i.e., g(y \ D,6) = 
1. We define a lower hound on L(9) as follows: 




>- i:^(ylA<')logf^ (3.4) 

= Q{q,e). (3.5) 

The inequality comes from Jensen's inequality ^og^- a jUj > Z^jajlogaj 
when J2j «j = 1- What we have obtained is an "auxiliary" function Q{q, 9) 
satisfying 

L{9)>Q{q,9), 

for all distributions q^y \ D,9). The maximization of Q{q,9) proceeds by 
interleaving the variables q and as we separately ascend on each set of 
variables. At the {t + 1) iteration we fix the current value of 9 to be 9^^^ 
of the t'th iteration and maximize Q{q,9^^^) over q, and then maximize 
g(g(*+i),e) over 9: 

g(*+i) = argmaxQ(g,0W) (3.6) 

= argmaxg(g(*+i),e). (3.7) 
e 

The strategy of the EM algorithm is to maximize the lower bound Q{q, 9) 
with the hope that if we ascend on the lower bound function we will also 
ascend with respect to L{9). The claim below guarantees that an ascend on 
Q will also generate an ascend on L: 

Claim 3 (Jordan-Bishop) The optimal q{y \ D,9^^^) at each step is 

P{y I D,9(')). 

Proof: We will show that Q{P{y \ D, fl^*)), 6l(*)) = L(6'(*)) which proves the 
claim since L{9) > Q{q, 9) for all q, 9, thus the best g-distribution we can 
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hope to find is one that makes the lower-bound meet L{6) at 9 = 6^^\ 



2_^ny\u,o J log p(y|^^0W) 
= logPCZ) I ^W) J]]P(y I Z),eW) 

y 

D 

The proof provides also the validity for the approach of ascending along 
the lower bound Q{q, 0) because at the point 9^^'^ the two functions coincide, 
i.e., the lower bound function at 9 = 9^^^ is equal to L{9^^^) therefore if 
we continue and ascend along Q(-) we are guaranteed to ascend along L{9) 
as wel|f] — therefore, convergence is guaranteed. It can also be shown (but 
omitted here) that the point of convergence is a stationary point of L{9) (was 
shown originally by C.F. Jeff Wu in 1983 years after EM was introduced in 
1977) under fairly general conditions. The second step of maximizing over 
9 then becomes: 

Q(t+i) ^ argmax V P(y | D, 6l(*)) log P[D, y \ 9). (3.8) 

' y 

This defines the EM algorithm. Often the "Expectation" step is described 
as taking the expectation of: 

%~p(y I DfiW) [logP(P',y I 9)] , 

followed by a Maximization step of finding 9 that maximizes the expectation 
— hence the term EM for this algorithm. 



Eqn. 3.8 describes a principle but not an algorithm because in general. 



without making assumptions on the statistical relationship between the data 



points and the hidden variable the problem presented in eqn. 3.8 is unwieldy. 



We will reduce eqn. 3.8 to something more manageable by making the i.i.d. 



assumption. This is detailed in the following section. 



t this manner of deriving EM was adapted from Jordan and Bishop's book notes, 2001. 
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3.2 EM with i.i.d. Data 

The EM optimization presented in eqn. |3.8| can be simplified if we assume 
the data points (and the hidden variable values) are i.i.d. 

n n 

P{D \9)=ll P(x, I 9), P{D, y I 0) = n H^i^V^ I 
1=1 1=1 

and 

n 

p{y\D,e) = llp{y^\^^,e). 

i=l 

For any a{yi) we have: 

^a{yi)P{y\D,9) = • • • a(y,)P(yi | xi, 0) • • • | x„„ 0) 

y yi Vn 

= ^a{yi)P{yi \ yii^O) 



this is because "^y. P{yj I Xj, 9) = 1. Substituting the simplifications above 
into eqn. |3.8| we obtain: 

k m 

Q{t+i) ^ argmax ^ ^ P{y, = a,- | x^, e^*)) log P(x„ = a, \ 9) (3.9) 

j=l i=l 



where yi £ {qi, ...,0^}. 



3.3 Back to the Coins Example 

We will apply the EM scheme to our running example of mixture of Bernoulli 
distributions. We wish to compute 

Q(0,0W) = ^P{y\D,9^'^)logPiD,y\9) 

y 

n 2 

= EE^(y*=-?' I Xi, ^W) log P(x„y,=i I 9), 

i=l j=l 
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and then maximize QQ with respect to p, g, A. 

n 

Q{e,9') = ^[P{yi = l\^i,e')logP{^i\yi = l,e)P{yi = l\e)] 

i=l 
n 

+ E [^(yi = 2 I xi, e') log P(xi \yi = 2, e)P{yi = 2\e)\ 

i=l 

= [M«log(Ap"Hl-j^)(=^-"')) + (l-//.)log((l-A)(z"Hl-9)^'" 

where ^' stands for and = P{yi = 1 | Xj,0'). The values of /Xj are 
known since 6' = {Xo,Po,Qo) Sive given from the previous iteration. The 
Bayes formula is used to compute /if. 

= Pfa,l|...,.),^(''.li>. = M-)Pte. = lK) 



\oPo'{l - Po)(3-"«) + (1 - Ao)g?'(l - (7o)(3-"i) 

We wish to compute: max^^^^;^ 6''). The partial derivative with respect 
to A is: 

from which we obtain the update formula of A given /Xj: 

1 " 

i=l 

The partial derivative with respect to p is: 



dQ _sr^ fJ-irij _ Mi (3 - rii) 
2^ p ^ 1 



0, 



i i 

from which we obtain the update formula: 

n, 

Likewise the update rule for q is: 

To conclude, we start with some initial "guess" of the values of p, q, A, com- 
pute the values of and update iteratively the values of p, q, A where at the 
end of each iteration the new values of //j are computed. 
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3.4 Gaussian Mixture 

The Gaussian mixture model assumes that i-'(x) where x £ R"^ is a hnear 
combination of Gaussian distributions 



P(x)=^P(x| y = j)P(y = j) 



where 



-P(x \ y = j) = cL 



1 



l|X-C,.||2 

3 



is Normally distributed with mean Cj and covariance matrix a'ji. Let D = 
{xi,...,Xm} be the i.i.d sample data and we wish to solve for the mean 
and covariances of the individual Gaussians (the " factors" ) and the mixing 
coefficients \j = P{y = j). In order to make clear where the parameters are 
located we will write P(x | instead of P(x | y = j) where (pj = (cj,a|) 
are the mean and variance of the j'th factor. We denote by 9 the collection 
of mixing coefficients Xj and cpj, j = 1, ...,k. Let wj be auxiliary variables 
per point Xi and per factor y = j standing for: 

= P{yi = j I Xi,6i). 



The EM step (eqn. 3.9) is: 



k m 

Q{t+i) ^ argmax V V wf^'^ log (AjP(xi | 0^)) s.t. V Xj = 1. (3.10) 

Note the constraint Aj = 1. The update formula for wl is done through 
the use of Bayes formula: 

_ P(|/.=j|^W)P(x. U.=j,gW) _ ^yit)p. I At). 

where Zi is a scaling factor so that wl = 1. 

The update formula for Xj , Cj , aj follow by taking partial derivatives of 



eqn. (3.10) and setting them to zero. Taking partial derivatives with respect 
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to Aj, Cj and Gj we obtain the update rules: 

1=1 

^ in 

Ei< i=i 

2 -'- 7 11 ||2 

(J A = T > wi\\yii — Cj\\ . 

In other words, the observations Xj are weighted by wl before a Gaussian is 
fitted {k times, one for each factor). 

3.5 Application Examples 

3.5.1 Gaussian Mixture and Clustering 

The Gaussian mixture model is classically used for clustering applications. 
In a clustering application one receives a sample of points xi, ...,x^ where 
each point resides in R^. The task of the learner (in this case "unsupervised" 
learning) is to group the m points into k sets. Let j/j G {1, ...,A;} where 
i = l,...,m stands for the required labeling. The clustering solution is an 
assignment of values to yi, ...,ym according to some clustering criteria. 

In the Gaussian mixture model points are clustered together if they arise 
from the same Gaussian distribution. The EM algorithm provides a proba- 
bilistic assignment P{yi = j \ Xi) which we denoted above as wf. 

3.5.2 Multinomial Mixture and "bag of words" Application 

The multinomial mixture (the coins example we toyed with) is typically used 
for representing "count" data, such as when representing text documents as 
high-dimensional vectors. A vector representation of a text document asso- 
ciates a word from a fixed vocabulary to a coordinate entry of the vector. 
The value of the entry represents the number of times that particular word 
appeared in the document. If we ignore the order in which the words ap- 
peared and count only their frequency, a set of documents di, ...,dm and a 
set of words wi,....,Wn could be jointly represented by a co-occurence nxm 
matrix G where Gij contains the number of times word Wi appeared in doc- 
ument dj. If we scale G such that '^ ■j Gij = 1 then we have a distribution 
P{w, d). This kind of representation of a set of documents is called "bag of 
words" . 
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For purposes of search and filtering it is desired to reveal additional infor- 
mation about words and documents such as to which "topic"' a document 
belongs to or to which topics a word is associated with. This is similar to 
a clustering task where documents associated with the same topic are to be 
clustered together. This can be achieved by considering the topics as the 
value of a latent variable y: 

P{w,d) = Y,P{w,d I y)P{y) =J2p{w\ y)P{d \ y)P{y), 
y y 

where we made the assumption that wA-d \ y (i.e., words and documents are 
conditionally independent given the topic). The conditional independent 
assumption gives rise to the multinomial mixture model. To be more specific, 
ley y G {1, .... A;} denote the k possible topics and let \j = P{y = j) (note 
that Xj = 1), then the latent class model becomes: 

k 

P{w, d) = Y^ \jP{w I y = j)P{d \y = j). 

Note that P{w | y = j) is a vector which we denote as uj G R"^ and P(d \ y = 
j) is also a vector we denote by Vj G K^. The term P{w \ y = j)P{d \ y = j) 
stands for the outer-product u^vj of the two vectors, i.e., is a rank-1 nxm 
matrix. The Maximum-Likelihood estimation problem is therefore to find 
vectors ui, ...,Ufc and vi, v^ and scalars Ai, Afc such that the empirical 
distribution represented by the unit scaled matrix G is as close as possible 
(in relative-entropy measure) to the low-rank matrix Y2j ^j^j^J subject to 
the constraints of non-negativity and Xj = 1, Uj and Vj are unit-scaled 
as well {I'^Uj = I'^Vj = 1). 

Let Xi = {w{i),d{i)) stand for the i'th example i = l,...,q where an 
example is a pair of word and document where w{i) G {1, n} is the index 
to the word alphabet and d{i) G {!,..., m} is the index to the document. 
The EM algorithm involves the following optimization step: 

q k 

i=l j=l 

g k 

= argmax ^ ^ w^f log [XjUj^w{i)Vj,d{i)] s.t. I'^A = l^Uj = I'^Vj = 1 
^ i=i j=i 

An update rule for ujr (the r'th entry of uj) is derived below: the derivative 
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of the Lagrangian is: 



d 



dUjr 



du 



N{r)logUjr ^ 

w{i)=i 



W, 







where N{r) stands for the frequency of the word Wr in ah the documents 
di, dm- Note that N{r) is the result of summing-up the r'th row of G and 
that the vector A^(l), A^(n) is the marginal P{w) = X^d -^('"^' '^)- Given 
the constraint l^u, = 1 we obtain the update rule: 



iV(r)E 



it) 



w{i)=r ^ij 



'w{i)=s ij 

Update rules for the remaining unknowns are similarly derived. Once EM 



has converged, then J2w{i)=r 
to the j'th topic and Ed(i)= 
comes from the j'th topic. 



w*j is the probability of the word Wr to belong 
g w*j is the probabihty that the s'th document 
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In this lecture we begin the exploration of the 2-class hyperplane separation 
problem. We are given a training set of instances Xj G i?", i = l,...,m, 
and class labels yi = ±1 (i.e., the training set is made up of "positive" 
and "negative" examples). We wish to find a hyperplane direction w G i?" 
and an offset scalar b such that w • Xj — 6 > for positive examples and 
w • Xj — 6 < for negative examples — which together means that the 
margins yj(w ■ Xj — 6) > are positive. 

Assuming that such a hyperplane exists, clearly it is not unique. We 
therefore need to introduce another constraint so that we could find the 
most "sensible" solution among all (infinitley many) possible hyperplanes 
which separate the training data. Another issue is that the framework is 
very limited in the sense that for most real-world classification problems 
it is somewhat unlikely that there would exist a linear separating function 
to begin with. We therefore need to find a way to extend the framework 
to include non-linear decision boundaries at a reasonable cost. These two 
issues will be the focus of this lecture. 

Regarding the first issue, since there is more than one separating hyper- 
plane (assuming the training data is linearly separable) then the question 
we need to ask ourselves is among all those solutions which of them has the 
best "generalization" properties? In other words, our goal in constructing 
a learning machine is not necessarily to do very well (or perfect) on the 
training data, because the training data is merely a sample of the instance 
space, and not necessarily a "representative" sample — it is simply a sample. 
Therefore, doing well on the sample (the training data) does not necessarily 
guarantee (or even imply) that we will do well on the entire instance space. 
The goal of constructing a learning machine is to maximize the performance 
on the test data (the instances we haven't seen), which in turn means that 
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we wish to generalize "good" classification performance on the training set 
onto the entire instance space. 

A related issue to generalization is that the distribution used to generate 
the training data is unknown. Unlike the statistical inference material we 
had so far, this time we will not attempt to estimate the distribution. The 
reason one can derive optimal learning algorithms yet bypass the need for 
estimating distributions would be explained later in the course when PAC- 
learning will be introduced. For now we will focus only on the algorithmic 
aspect of the learning problem. 

The idea is to consider a subset C-y of all hyperplanes which have a fixed 
margin 7 where the margin is defined as the distance of the closest training 
point to the hyperplane: 

7 = mm < — > . 

» I l|w|| J 

The Support Vector Machine (SVM), first introduced by Vapnik and his 
colleagues in 1992, seeks a separating hyperplane which simultaneously min- 
imizes the empirical error and maximizes the margin. The idea of maximiz- 
ing the margin is intuitively appealing because a decision boundary which 
lies close to some of the training instances is less likely to generalize well 
because the learning machine will be susceptible to small perturbations of 
those instance vectors. A formal motivation for this approach is deferred to 
the PAC-learning material we will introduce later in the course. 



4.1 Large Margin Classifier as a Quadratic Linear Programming 

We would first like to set up the linear separating hyperplane as an optimiza- 
tion problem which is both consistent with the training data and maximizes 
the margin induce by the separating hyperplane over all possible consistent 
hyperplanes. 

Formally speaking, the distance between a point x and the hyperplane is 
defined by 

I w • X — 6 I 
• w 

Since we are allowed to scale the parameters w, b at will (note that if w • 
x — 6 > so is (Aw) • X — (Aft) > for all A > 0) we can set the distance 
between the boundary points to the hyperplane to be by scaling 

w, b such the point(s) with smallest margin (closest to the hyperplane) will 
be normalized: | w-x — 6 |= 1, therefore the margin is simply 2/-y/w • w (see 
Fig. 5.1). Note that argmax-^2/-y/w • w is equivalent to argmax-^2/(w • w) 
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which in turn is equivalent to argmin^^w • w. Since all positive points 

and negative points should be farther away from the boundary points we 
also have the separability constraints w ■ x — b > 1 when x is a positive 
instance and w-x — 6 < —1 when x is a negative instance. Both separability 
constraints can be combined: y{w • x — 6) > 1. Taken together, we have 
defined the following optimization problem: 

1 

mm -w • w 

W,6 2 

subject to 

yi(w • Xj - 6) - 1 > i = l,...,m 

This type of optimization problem has a quadratic criteria function and 
linear inequalities and is known in the literature as a Quadratic Linear Pro- 
gramming (QP) type of problem. 

This particular QP, however, requires that the training data are linearly 
separable — a condition which may be unrealistic. We can relax this condi- 
tion by introducing the concept of a "soft margin" in which the separability 
holds approximately with some error: 



(4.1) 

(4.2) 



min • w + z/ ei (4.3) 

1=1 

subject to 

yi{w ■■Xi-b) >1 - Ci i = l,...,m 
ei>0 

Where v is some pre-defined weighting factor. The (non-negative) vari- 
ables ei allow data points to be miss- classified thereby creating an approx- 
imate separation. Specifically, if Xj is a positive instance (y^ = 1) then the 
"soft" constraint becomes: 

w • Xj - 6 > 1 - ej, 

where if ej = we are back to the original constraint where Xj is either a 

boundary point or laying further away in the half space assigned to positive 
instances. When > the point Xj can reside inside the margin or even in 
the half space assigned to negative instances. Likewise, if Xj is a negative 
instance {yi = — 1) then the soft constraint becomes: 

w • X,: - 6 < -1 -I- e,;. 
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I wl 




Fig. 4.1. Separating hyperplanc w, 6 with maximal margin. The boundary points 
are associated with non-vanishing Lagrange multiphers fii > and margin errors 
are associated with > where the criteria function encourages a small number 
of margin errors. 



The criterion function penalizes (the Li-norm) for non- vanishing a thus the 
overall system will seek a solution with few as possible "margin errors" (see 
Fig. 5.1 1 . Typically, when possible, an Li norm is preferable as the L2 norm 
overly weighs high magnitude outliers which in some cases can dominate 
the energy function. Another note to make here is that strictly speaking 
the "right thing" to do is to penalize the margin errors based on the Lq 
norm ||e||Q = \{i : > 0}|, i.e., the number of non-zero entries, and drop 
the balancing parameter u. This is because it does not matter how far away 
a point is from the hyperplane — all what matters is whether a point is 
classified correctly or not (see the definition of empirical error in Lecture 4) . 
The problem with that is that the optimization problem would no longer be 
convex and non-convex problems are notoriously difficult to solve. Moreover, 
the class of convex optimization problems (as the one described in Eqn. 4.3 1 
can be solved in polynomial time complexity. 

So far we have described the problem formulation which when solved 
would provide a solution with "sensible" generalization properties. Although 
we can proceed using an off-the-shelf QLP solver, we will first pursue the 
"dual" problem. The dual form will highlight some key properties of the 
approach and will enable us to extend the framework to handle non-linear 
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decision surfaces at a very little cost. In the appendix we take a brief tour on 

the basic principles associated with constrained optimization, the Karush- 
Kuhn- Tucker (KKT) theorem and the dual form. Those are recommended 
to read before moving to the next section. 

4.2 The Support Vector Machine 

We return now to the primal problem (eqn. 6.3) representing the maximal 
margin separating hyperplane with margin errors: 



We will now derive the Lagrangian Dual of this problem. By doing so a 
new key property will emerge facilitated by the fact that the criteria func- 
tion (note there are no equality constraints thus there is no need for A) 
involves only inner-products of the training instance vectors Xj. This prop- 
erty will form the key of mapping the original input space of dimension n to 
a higher dimensional space thereby allowing for non-linear decision surfaces 
for separating the training data. 

Note that with this particular problem the strong duality conditions are 
satisfied because the criteria function and the inequality constraints form a 
convex set. The Lagrangian takes the following form: 



mm 




i=l 



subject to 

yi(w • Xj - 6) > 1 - ej i = l,..., 

ei>0 



m 



m m 



m 



L(w, 




W ■ Xj - 



6)-l + ei]-^<5, 



1=1 1=1 



i=l 



Recall that 



9{ijl) = min L(w, 6, e, yit, 5). 

W,6,e 



Since the minimum is obtained at the vanishing partial derivatives of the 
Lagrangian with respect to w, 5, the next step would be to evaluate those 
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constraints and substitute them back into L() to obtain 

dL 



dL 
db 

dL 

dei 



i 







(4.4) 
(4.5) 
(4.6) 



From the first constraint (4.4) we obtain w = ^^/ijyjXj, that is, w is de- 
scribed by a hnear combination of a subset of the training instances. The 
reason that not all instances participate in the linear superposition is due 
to the KKT conditions: = when yi(w • Xj — &) > 1, i.e., the instance Xj 
is classified correctly and is not a boundary point, and conversely, /Uj > 
when yi(w • Xj — 5) = 1 — ej, i.e., when Xj is a boundary point or when 
Xj is a margin error (ej > 0) — note that for a margin error instance the 
value of ej would be the smallest possible required to reach an equality in 
the constraint because the criteria function penalizes large values of ej . The 
boundary points (and the margin errors) are called support vectors thus w is 
defined by the support vectors only. The third constraint (4.6) is equivalent 
to the constraint: 



< /ii < 



1, 



J, 



since 5i > 0. Also note that if ej > 0, i.e., point Xj is a margin-error point, 
then by KKT conditions we must have 5i = 0. As a result jjn = u. Therefore 
based on the values of /ij alone we can make the following classifications: 

• < fii < u: point Xj is on the margin and is not a margin-error. 

• fii = I': points Xj is a margin-error point. 

• fii = 0: point Xj is not on the margin. 

Substituting these results/constraints back into the Lagrangian L() we 
obtain the dual problem: 



max 
/ii,...,/i„ 



m ^ 
^(A*) =^l^i~ nJ2 l^il^jyiVi^i " 



(4.7) 



subject to 
< /Uj < 

m 



, m 



The criterion function 0(fj,) can be written in a more compact manner as 
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follows: Let M be a Z x Z matrix whose entries are Mij = yiyjXi ■ Xj then 
9{iJ,) = /jJl — ^ijJ Mfi where 1 is the vector of (1, 1) and is the vector 
{fii, fim) and ij7 is the transpose (row vector). Note that M is positive 
definite, i.e., Mx. > for all vectors x / — a property which will be 
important later. 

The key feature of the dual problem is not so much that it is simpler 

than the primal (in fact it isn't since the primal has no equality constraints) 
or that it has a more "elegant" feel, the key feature is that the problem 
is completely described by the inner products of the training instances Xj, 
i = 1, m. This fact will be shown to be a crucial ingredient in the so called 
"kernel trick" for the computation of inner-products in high dimensional 
spaces using simple functions defined on pairs of training instances. 

4.3 The Kernel Trick 

We ended with the dual formulation of the SVM problem and noticed that 
the input data vectors Xj are represented by the Gram matrix M. In other 
words, only inner-products of the input vectors play a role in the dual for- 
mulation — there is no explicit use of or any other function of Xj besides 
inner-products. This observation suggests the use of what is known as the 
"kernel trick" to replace the inner-products by non-linear functions. 

The common principle of kernel methods is to construct nonlinear vari- 
ants of linear algorithms by substituting inner-products by nonlinear kernel 
functions. Under certain conditions this process can be interpreted as map- 
ping of the original measurement vectors (so called "input space") onto 
some higher dimensional space (possibly infinitely high) commonly referred 
to as the "feature space". Mathematically, the kernel approach is defined 
as follows: let xi,...,x/ be vectors in the input space, say i?", and con- 
sider a mapping (/>(x) : i?" T where T is an inner-product space. The 
kernel-trick is to calculate the inner-product in T using a kernel function 
k : i?" X i?" — > R, k{'Ki,:x.j) = (f){-Ki)~^ (p{-Kj) , while avoiding explicit mappings 
(evaluation of) (f){). 

Common choices of kernel selection include the; d'th order polynomial 
kernels /c(xj,Xj) = (xjxj -|- 6)'^ and the Gaussian RBF kernels A;(xj,Xj) = 
exp(— 23^11x1 — Xjlp). If an algorithm can be restated such that the input 
vectors appear in terms of inner-products only, one can substitute the inner- 
products by such a kernel function. The resulting kernel algorithm can be 
interpreted as running the original algorithm on the space J- of mapped 
objects (/>(x). 

We know that M of the dual form is positive semi-definite because M 
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can be written is M = Q^Q where Q = [j/iXi, ...,y;X;]. Therefore x^Mx = 
IIQxp > for all choices of x (which means that the eigenvalues of M are 
non-negative). If the entries of M arc to be replaced with yiyjk(xi,Xj) then 
the condition we must enforce on the function k{) is that it is a positive 
definite kernel function. A positive definite function is defined such that 
for any set of vectors xi, ...,Xq and for any values of q the matrix K whose 
entries are Kij = /c(xj,Xj) is positive semi-definite. Formally, the conditions 
for admissible kernels fe() are known as Mercer's conditions summarized 
below: 

Theorem 4 (Mercer's Conditions) Let k{x, y) he symmetric and contin- 
uous. The following conditions are equivalent: 

(i) k{x,y) = X^^i ai(j)i{x)(pi{y) = (p{x)^ (p{y) for any uniformly converg- 
ing series > 0. 

(ii) for all ipQ satisfying j^'4P'{x)dx < oo, then 

k{x, y)^{x)ip{y)dxdy > 

(iii) for all {xj}?^-,^ and for all q, the matrix K^j = k{xi,Xj) is positive 
semi-definite. 

Perhaps the non-obvious condition is No. 1 which allows for the feature 
map 4>{) to have infinitely many coordinates (a vector in Hilbert space). For 
example, as we shall see below, the kernel exp(— 2^||xj — Xjp) is an inner- 
product of two vectors with infinitely many coordinates. We will consider 
next a number of popular kernels. 




4.3.1 The Homogeneous Polynomial Kernel 

Let X, y G and define fc(x, y) = (x^y)'^ where li > is a natural number. 
Then, the corresponding feature map ^(x) has {^~^'^~^) = 0{k'^) coordinates 
which take the value: 



m, ...,nk 



where (,^^^ = dl/{nil ■ ■ ■ n^l) is the multinomial coefficient (number of 

ways to distribute d balls into k bins where the j'th bin hold exactly uj > 
balls): 



ixi + ... + Xk)'= Yl f Vi 
„^ni^„._. \ni,...,nkj 
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The dimension of the vector space 0(x) where x G i?'^ can be measured 
using the following combinatorial problem: how many arrangements of /c — 1 
partitions to be placed among d items? the answer is (^j^^^^) = ~ 
0{k'^). For example, k = d = 2 gives us : 

(x^y)^ = xjyl + 2xiX2yiy2 + a;iy| = (?!)(x)'^<^(y), 

where ^(x) = (rcf , x^, \/2a;ia;2). 



4 •3.2 The non-homogeneous Polynomial Kernel 

The feature map 0(x) contains all monomials whose power is lesser or equal 
to d, i.e., X^jHi < d. This can be acheived by increasing the dimension 
to A; + 1 where n^+i is used to fill the gap between Yli=i < d and d. 
Therefore the dimension of <^(x) where x G -R'^ would be i^^*^)- We have: 

{^''y + ef = {xiyi + ... + Xkyk + ^eVet 

\ni,...,nk+ij 

Therefore, the entries of the vector (^(x) take the values: 



</,(x)=fjf ^ )x1^---xl'-d^^+^A 



For example, k = d = 2 gives us : 

(x^y + 9)^ = xjyf + 2xiX2yiy2 + xjyj + 20xm + 2^x2^2 + = <^(x)^<^(y), 

where ^(x) = (xf , x^, V2xiX2, V29xi,V29x2, VO). In this example, is a 
mapping from to and hyperplanes (?!)(w)^^(x)— 6 = in i?^ correspond 
to conies in B?: 

{wl)x\ + {wl)x2 + {2wiW2)xiX2 + {2ewi)xi + {2eW2)X2 + (61 - 6) = 

Assume we would like to find a separating conic (Parabola, Hyperbola, 
Ellipse) function rather than a line in i?^. The discussion so far suggests we 
construct the Gram matrix M in the dual form with the d = 2 polynomial 
kernel fc(x, y) = (x'''y + ^)^ for some parameter 6 of our choosing. The extra 
effort we will need to invest is negligible — simply replace every occurrence 

x7x,- with (xTx, +^)2. 
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4.3.3 The RBF Kernel 

The function A;(x,y) = e^H"^^^"^/^'^^ known as a Radial Basis Function 
(RBF) is a kernel function but with an infinite expansion. Without loss of 
generality let o" = 1, then we have: 

g-||X-yl|2/2 ^ g-||X|l2/2g-||yf /2gXTy 

(x y)^ ^-||x||V2^-||yHV2 




e 2j / j 



From which we can see that the entries of the feature map (/'(x) are: 



^ / • \ 1/2 



4.3.4 Classifying New Instances 

By adopting some kernel k{) we are in fact mapping x 0(x), thus we 
then proceed to solve for (/)(w) and b using some QLP solver. The QLP 
solution of the dual form will yield the solution for the Lagrange multipliers 



//I, ..-iHm- We saw from eqn. (4.4) that we can express (/)(w) as a function 
of the (mapped) examples: 

'/'(w) = ^/iiyi(/'(xi). 

i 

Rather than explicitly representing 0(w) — a task which may be prohibitly 
expensive since in general the dimension of the feature space of a polynomial 
mapping is — we store all the support vectors (those input vectors 

with corresponding > 0) and use them for the evaluation of test examples: 

/(x) = sic/n(</>(w)^0(x) -b) = sign(^ ^iiyi(l){yii)^ (t){yi) - b) 

i 

= signC^ ij,iyik{xi,x) - b). 
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We see that the kernel trick enabled us to look for a non-linear separating 
surface by making an implicit mapping of the input space onto a higher di- 
mensional feature space using the same dual form of the SVM formulation — 
the only change required was in the way the Gram matrix was constructed. 
The price paid for this convenience is to carry all the support vectors at the 
time of classification /(x). 

A couple of notes may be worthwhile at this point. The constant b can 
be recovered from any of the support vectors. Say, is a positive support 
vector (but not a margin error, i.e., /ij < u). Then (p{w)~^ (^{'k'^) —6=1 
from which b can be recovered. The second note is that the number of 
support vectors is typically around 10% of the number of training examples 
(empirically). Thus the computational load during evaluation of /(x) may 
be relatively high. Approximations have been proposed in the literature by 
looking for a reduced number of support vectors (not necessarily aligned 
with the training set) — but this is beyond the scope of this course. 

The kernel trick gained its popularity with the introduction of the SVM 
but since then has taken a life of its own and has been applied to principal 
component analysis (PCA), ridge regression, canonical correlation analysis 
(CCA), QR factorization and the list goes on. We will meet again with the 
kernel trick later on. 
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In this lecture (and the following one) we will focus on spectral methods for 
learning. Today we will focus on dimensionality reduction using Principle 
Component Analysis (PCA), multi-class learning using Linear Discriminant 
Analysis (LDA) and Canonical Correlation Analysis (CCA). In the next 
lecture we will focus on spectral clustering methods. 

Dimensionality reduction appears when the dimension of the input vector 
is very large (imagine pixels in an image, for example) while the coordi- 
nate measurements are highly inter-dependent (again, imagine the redun- 
dancy present among neighboring pixels in an image). High dimensional 
data impose computational efficiency challenges and often translate to poor 
generalization abilities of the learning engine (see lectures on PAC). A di- 
mensionality reduction can also be viewed as a feature extraction process 
where one takes as input a large feature set (the original measurements) 
and creates from them a much smaller number of new features which are 
then fed into the learning engine. 

In this lecture we will focus on feature extraction from a very specific (and 
constrained) stanpoint. We would be looking for a mixing (linear combina- 
tion) of the input coordinates such that we obtain a linear projection from 
i?" to R'^ for some q < n. In doing so we wish to reduce the redundancy 
while preserving as much as possible the variance of the data. Prom a sta- 
tistical standpoint this is achieved by transforming to a new set of variables, 
called principal components, which are un correlated so that the first few 
retain most of the variation present in all of the original coordinates. For 
example, in an image processing application the input images are highly re- 
dundant where neighboring pixel values are highly correlated. The purpose 
of feature extraction would be to transform the input image into a vector of 
output components with the least redundancy possible. Form a geometric 
standpoint, this is achieved by finding the "closest" (in least squares sense) 
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linear g-dimensional susbspace to the m sample points S. The new sub- 
space is a lower dimensional "best approximation" to the sample S. These 
two, equivalent, perspectives on data compression (dimensionality reduc- 
tion) form the central idea of principal component analysis (PCA) which 
probably the oldest (going back to Pearson 1901) and best known of the 
techniques of multivariate analysis in statistics. The computation of PCA 
is very simple and the definition is straightforward, but has a wide variety 
of different applications, a number of different derivations, quite a number 
of different terminologies (especially outside the statistical literature) and is 
the basis for quite a number of variations on the basic technique. 

We then extend the variance preserving approach for data representation 
for labeled data sets. We will describe the linear classifier approach (sepa- 
rating hyperplane) form the point of view of looking for a hyperplane such 
that when the data is projected onto it the separation is maximized (the dis- 
tance between the class means is maximal) and the data within each class is 
compact (the variance/spread is minimized). The solution is also produced, 
just like PCA, by a spectral analysis of the data. This approach goes under 
the name of Fisher's Linear Discriminant Analysis (LDA). 

What is common between PCA and LDA is (i) the use of spectral ma- 
trix analysis — i.e., what can you do with eigenvalues and eigenvectors of 
matrices representing subspaces of the data? (ii) these techniques produce 
optimal results for normally distributed data and are very easy to imple- 
ment. There is a large variety of uses of spectral analysis in statistical and 
learning literature including spectral clustering, Multi Dimensional Scaling 
(MDS) and data modeling in general. Another point to note is that this is 
the first time in the course where the type of data distribution plays a role 
in the analysis — the two techniques are defined for any distribution but 
are optimal only under the Gaussian distribution. 

We will also describe a non-linear extension of PCA known as Kernel- 
PCA, but the focus would be mostly on PCA itself and its analysis from 
a couple of vantage points: (i) PCA as an optimal reconstruction after a 
dimension reduction, i.e., data compression, and (ii) PCA for redundancy 
reduction (decorrelation) of the output components. 



5.1 PCA: Statistical Perspective 

Let xi,...,x^ G i?" be our sample data S of vectors in K^, arranged as 

columns of a matrix A. It will be convenient to assume that the data is 
centered, i.e., = 0. If the data is not centered we can always center it 

by computing the mean vector = (1/m) Xj and replace the original data 
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sample with the new sample x, — jjL. In a statistical sense, the coordinates 
of the vector X € i?" arc considered as random variables, thus a row in the 
matrix A is the sample of values of a particular random variable, drawn from 
some unknown probability distribution, associated with the row position. 
We wish to find vectors Ui,...,Uq (arranged as columns of a matrix [/), 
where q < min{n,m), such that the new feature measurements y = C/^x 
(who are the result of linear combinations uJJ'x, ujx of the original feature 
measurements x) have certain desirable properties. 

The idea property to seek from the new coordinates y is statistical inde- 
pendence, i.e., P{yi, .., yq) = P{yi) ■ ■ -Piyq) which would mean that we have 
removed the redundancy of the original data x in the best possible manner. 
This goal, however, is too much to ask from a linear transformation and 
instead we would ask for a weaker property to hold: that the pairwise co- 
variance cov{yi, yj) = vanishes, i.e., that the covariance matrix on the new 
coordinates is diagonal. A diagonal covariance insures some redundancy re- 
moval, but not as good as statistical independence. However, when the data 
is Normally distributed -P(x) ~ N(iJ,, S) with mean /j, and covariance S, then 
the transformation which diagonalizes the covariance matrix also guarantees 
statistical independence. Among all transformations that de-correlate the 
data we will seek the one that maximizes the spread (variance) of the sample 
data after being projected onto the new axes vectors. 

5.1.1 Maximizing the Variance of Output Coordinates 

The property we would like to maximize is that the projection of the sample 
data on the new axes is as spread as possible. To start this analysis, assume 
q = 1, i.e., the n components of the input vector x are reduced to a single 
output component y = u^x. We are looking for a single vector u G .R" 
whose direction maximizes the variance of the output component y. 

Formally, we are looking for a unit vector u which maximizes X^j(u^Xj)^ 
(see Appendix A for basic statistical definitions and note that E[y] = 
because X^jU^Xj = u^(^^Xj) = 0). In other words, the projected points 
onto the axis represented by the vector u are as spread as possible (in a 
least squares sense). In vector notation, the optimization problem takes the 
following form: 

max-||u^A|p subject to -u^u = 1 
u 2" " 2 

The Lagrangian of the problem is: 

L(u, A) = ^u^^^^u - A(^u^u - 1) 
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By taking the partial derivative dL/du = Owe obtain the following necessary 
condition (see Appendix B): 

AA^u = Au, 

which tells us that u is an eigenvector of the nxn (symmetric and positive 
definite) matrix AA~^ . There are n eigenvectors associated with AA~^ and 
we can easily convince ourselves that we arc looking for the one associated 
with the maximal eigenvalue: substitute Au instead of AA^u in the criterion 
function u^^^^u to obtain A(u^u) = A and since the eigenvalues must be 
positive (since AA~^ is positive definite), then the optimum is obtained for 
the maximal eigenvalue. The leading eigenvector u of AA~^ is called the first 
principal axis of the data sample represented by the columns of the matrix 
A, and y = u^x is called the first principal component of the data sample. 

For convenience, we denote Ui = u and Ai = A as the leading eigenvector 
and eigenvalue of AA^. Next, we look for 7/2 = ujx which is uncorrelated 
with yi = ujx. and which has maximum variance (and so on for U3, u^). 
Two random variables are uncorrelated if their covariance vanishes. By 
definition of covariance (see Appendix A) we obtain: 

Coviym) = ^(u7xi)(ujxj) = u7(^xix7)u2 

i i 

= uJAA^U2 = U2 ^^^ui = Aiu7u2 = 

We can therefore use the condition u7u2 = to specify zero correlation 
between yi,y2- The functional to be optimized becomes: 

max-||ujA|p subject to -uju2 = 1, U]'^U2 = 0, 
U2 2 2 

with the Lagrangian being: 

L(u2, A, S) = ^uJAA'^U2 - A(^u^U2 - 1) - (5u7u2. 

By taking the partial derivative with respect to U2 we obtain the necessary 
condition: 

AA^U2 - Au2 - Sui = 0. 

Multiply the equation by ui from the left: 

uj AA^U2 — Auj^U2 — (5u7ui = 0, 

and noting from above that uj AA^U2 = u7u2 = we obtain 6 = 0. As a 
result we obtain: 

AA'^U2 = AU2, 
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so once more we have that A, U2 form an eigenvalue/eigenvector pair of AA^ . 
As before, A should be as large as possible from the remaining spectral de- 
composition. By induction, it can be shown that the remaining principal 
vectors U3, are the decreasing order eigenvactors of AA^ and the vari- 
ance of the z'th principal component yi = ujx is Aj. 

Taken together, the PCA is the solution of the following optimization 
problem: 



It will be useful for later to write the optimization function in a more concise 
manner as follows. Let U be the n x q matrix whose columns are and 
D = diag{Xi, A^) is an gxg diagonal matrbc and Ai > A2 > ... > A^. Then 
from above we have that U^U = I and AA~^U = UD. Using the fact that 
trace (xy""") = x'^y, trace{AB) = trace{BA) and trace{A + B) = trace{A) + 
trace{B) we can convert ||uj^||^ to trace{U^ AAJ U) as follows: 

y^^ul AAJ Uj = ^^trace{AJ\Xi\xl A) = trace{AJ (^Uiul)A) 

i i i 

= trace{A'^UU^A) = trace{U^ AA'^ U) 
Thus, PCA becomes the solution of the following optimization function: 



The solution, as saw above, is that U = [ui, u^] consists of the decreasing 

order eigenvectors of AA~^ . At the optimum, trace{U~^ AAJ U) is equal to 
trace(D) which is equal to the sum of eigenvalues Ai + ... + Xq. 

It is worthwhile noting that when q = n, UU~^ = U~^U = I, and the PCA 
transform is a change of basis in i?" known as Karhunen-Loeve transform. 

To conclude, the PCA transform looks for q orthogonal direction vectors 
(called the principal axes) such that the projection of input sample vectors 
onto the principal directions has the maximal spread, or equivalently that 
the variance of the output coordinates y = C/^x is maximal. The principal 
directions are the leading (with respect to descending eigenvalues) q eigen- 
vectors of the matrix AA"''. When q = n, the principal directions form a 
basis of i?" with the property of de-correlating the data and maximizing the 
variance of the coordinates of the sample input vectors. 




subject to Uj Uj = 1, Uj Uj = 0, i ^ j = 1, q. 



max traceiU 



■^AA" 



U) subject to U^U = I. 



(5.1) 
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5.1.2 Decorrelation: Diagonalization of the Covariance Matrix 

In the previous section we saw that PCA generates a new coordinate system 
y = [/"""x where the coordinates 7/1, of x in the new system are uncorre- 
lated. This means that the covariance matrix over the principle components 
should be diagonal. In this section we will explore this perspective in more 
detail. 

The covariance matrix Xl^,. of the sample data xi, ...,Xm with zero mean 

is 

(l/m)^x,x7 = (l/m)AA^, 

i 

therefore the matrix AA^ we derived above is a scaled version of the co- 
variance of the sample data (see Appendix A). The scale factor 1/m was 
unimportant in the process above because the eigenvectors are of unit norm, 
thus any scale of AA^ would produce the same set of eigenvectors. 

The off-diagonal entries of the covariance matrix represent the corre- 
lation (a measure of statistical dependence) between the i'th and j'th com- 
ponent vectors, i.e., the entries of the input vectors x. The existence of 
correlations among the components (features) of the input signal is a sign 
of redundancy, therefore from the point of view of transforming the input 
representation into one which is less redundant, we would like to find a 
transformation y = [/"""x with an output representation y which is associ- 
ated with a diagonal covariance matrix Sy, i.e., the components of y are 
uncorrelated. 

Formally, Jly = {^/nT^^^y^yj = [1 / m)U"^ AA^ U , therefore we wish to 
find an n X g matrix for which AAJU is diagonal. If in addition, we 
would require that the variance of the output coordinates is maximized, 
i.e., trace{U~^ AAJ U) is maximal (but then we need to constrain the length 
of the column vectors of C/, i.e., set ||uj|| = 1) then we would get a unique 
solution for U where the columns are orthonormal and are defined as the first 
q eigenvectors of the covariance matrix 5] 3;. This is exactly the optimization 



problem defined by eqn. (5.1). 

We see therefore that PCA "decorrelates" the input data. Decorrelation 
and statistical independence are not the same thing. If the coordinates are 
statistically independent then the covariance matrix is diagona^fj but it does 
not follow that uncorrelated variables must be statistically independent — 
covariance is just one measure of dependence. In fact, the covariance is a 
measure of pairwise dependency only. However, it is a fact that uncorrelated 
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variables are statistically independent if they have a multivariate normal 
distribution (a Gaussian). In other words, if the sample data x arc drawn 
from a probability distribution p{x) which has Gaussian form, the PCA 
transforms the sample data into a statistically independent set of variables 
y = C/^x. The details are explained below. 

Recall that a multivariate normal distribution of the random variables 
X = {xi, Xji)'^ is defined as p(x) Ki N{fi, S): 

" (27r)"/2|E!V2'= 

Also recall that a linear combination of the variables produces also a normal 
distribution N{U^ fi,U'^T;U): 

y X 

therefore choose U such that "Sy = U'^'SU is a diagonal matrix = 
diag{ai, a^). We have in that case: 

which can be written as a product of univariate normal distributions Pxi{xi)'- 

which proves the assertion that decorrelated normally distributed variables 
are statistically independent. 



5.2 PCA: Optimal Reconstruction 

A different, yet equivalent, perspective on the PCA transformation is as an 
optimal reconstruction (in a least squares sense) after a dimension reduction. 
We are given a sample data as before xi, x^ and we are looking for a small 
number of orthonormal principal vectors ui,...,Uq where q < min{n,k) 
which define a q-dimensional linear subspacc of i?" which best approximate 
the original input vectors in a least squares sense. In other words, the 
projection x, of the sample points Xj onto the q-dimensional subspace should 
minimize J2i ll^i — Xj|p over all possible q-dimensional subspaces of R^. 

Let U be the subspace spanned by the principal vectors (columns of U) 
and let P be the n x n projection matrix mapping a point x G i?" onto its 
projection x £U. From the definition of projection, the vector x — x must 
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be orthogonal to the subspace U. Let y = (yi, y^) be the coordinates of x 
with respect to the principal vectors, i.e., Uy = x. Then, from orthogonality 
we have that (x — C/y)~''C/w = for all vectors w G Since this is true 
for all w then U~^Uy — C/^x = 0. Therefore, y = {U~^ U)~^U~^ 'x. and as a 
result the projection matrix P becomes: 

p = u{u^u)-^u^, 

satisfying Px = x. In the case the columns of U are orthonormal, U^U = I, 
we have P = UU^ . We are ready now to describe the optimization problem 
on [/: we wish to find an orthonormal set of principal vectors, U~^U = /, 
such that ||xj — J7J7^Xj|p is minimized. 

Note that j^i W^i - UU^x^f = \\A - UU^ Afp where \\Bfp = J^ij^ij 
is the square Frohenious norm of a matrix. The optimal reconstruction 
problem therefore becomes: 

min IIA - [/[/"^Allp subject to U^U = I. 
u 

We will show now that: 

argmin \\A — UU^ AW'j^^ = argmax trace([/'^AA'^C/), 
u u 

which shows that the optimal reconstruction problem is solved by PCA 



(recall Eqn. 5.1 ). 



From the identity ||-B|||, = trace{BB~^), we have: 

\\A - UU^AWl = trace{{A - UU^ A){A - UU^A)'^). 
Expanding the right hand side gives us: 

trace{{A-UU'^A){A-UU^A)~^) = trace{AA~^) - trace{AA^ UU^) 

- trace{UU'^ AA^)+trace{UU^ AA^UU 

The second and third term are equal (commutativity of trace) and is also 
equal to the 4th term due to commutativity of the trace and U"^U = /. 
Taken together: 

\\A - UU'^Afp = trace{AA~^) - trace{U^ AA^ U). 

To conclude, we have proven that by taking the first q eigenvectors of AA~^ 
we obtain a linear subspace which is as close as possible (in a least squares 
sense) to the original sample data. Hence, PCA can be viewed as a vehi- 
cle for optimal reconstruction after dimension reduction. The optimization 
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problem whose solution is the leading q eigenvectors of AA^ is described in 
eqn. |5.1[ 

max traceiU^ AA~^U) subject to U^U = I. 

5.3 The Case n » m 

Consider the situation where n, the dimension of the input vectors, is rela- 
tively large compared to the number of sample vectors m. For example, con- 
sider input vectors representing 50 x 50 sized images of faces, i.e., n = 2500, 
where m = 100. In other words, we are looking for a small number of "face 
templates" (known as "eigenfaces" ) which approximate well the original set 
of 100 face images. In this case, AAJ is very large, 2500 x 2500, yet the num- 
ber of non-vanishing eigenvalues cannot be higher than 100. Given that the 
eigendecomposition process is 0(2500'^), the computational burden would 
be very high. However, it is possible to perform an eigendecomposition on 
AJ A (a 100 X 100 matrix) instead, as shown next. 

Let the columns of Q be the first q < m eigenvectors of A"^ A, i.e., A~^ AQ = 
QD where D is diagonal containing the corresponding eigenvalues. After 
pre-multiplying both sides by A we obtain: 

AA'^iAQ) = {AQ)D, 

from which we conclude that AQ contains the first q eigenvectors (but un- 
normalized) of AA^ . We have therefore that U = AQD~2 because: 

U'^U = D-^Q'^A^AQD-^ = D-^DD-^ = I, 

where we used the fact that A~^ AQ = D. Note that eigenvalues of A~^ A 
and AA~^ are the same (because AA'^ (AQD^^) = {AQD^^)D). 

5.4 Kernel PCA 

We can take the case n» m described in the previous section one step fur- 
ther and consider such large values of n which are practically uncomputable 
— a situation which results when mapping the original input vectors to a 
high dimensional space: (/>(x) where (j) : i?" — > J- for which dim{J-) » n. 
For example, (/>(x) representing the d'th order monomials of the coordinates 
of x, i.e., dim{J^) = {^~^'^~^) which is exponential in d. The mappings 
of interest are those which are paired with a non-linear kernel function: 
A:(x,x') = </.(x)T0(x'). 

Performing PCA on A = [0(xi), 0(xm)] is equivalent to finding the 



50 



Spectral Analysis I: PCA, LDA, CCA 



non-linear surface in i?" (the nature of the non-Hnearity depends on the 
choice of (f>{)) which best approximates tlie original sample data xi, ...,x„j. 
The problem is that AA^ is not computable — however A"^ A is computable 
because {A~^ A)ij = A;(x.j,Xj). 

From the previous section, U = AQD^2 = AV contains the first q eigen- 
vectors of (where Q and D are computable). Since A itself is not 
computable we cannot represent U explicitly, but we can project a new 
vector <p(x) onto the principal directions ui, ...,Ug and obtain the principal 
components, i.e., the output vector y = J7^^(x), as follows. 



t/T0(x) 



V^A-'c 



(x) = 



I ^(xi,x) \ 



V A;(x^,x) / 



Given the principal components (entries of y = U'^ (^{yi) of i^(x)) we can 
measure, for example, the distance between ^(x) and the projection ^(x) = 
UU"^ (f){yi) = Uy onto the hnear subspace spanned by ui, (without the 
need to explicitly compute the principal axes u^), as follows. 



||<^(x)-0(x)||2 = 0(x)T<^(x) + 0(x) </,(x) - 20(x)^,^(x) 

= A;(x, x) + y'^U^Uy - 2<t>{yif {UU^ ^{^)) 

= A;(x, x) - y"^y - 2y'^y 

= A;(x,x)-||yf 



5.5 Fisher's LDA: Basic Idea 

We now extend the variance preserving approach for data representation for 
labeled data sets. We will focus on 2-class sets and look for a separating 
hyperplane: 

/(x) = wTx + 6, 

such that X belongs to the first class if /(x) > and x belongs to the second 
class if /(x) < 0. In the statistical literature this type of function is called 
a linear discriminant function. The decision boundary is given by the set 
of points satisfying /(x) = which is a hyperplane. Fisher's (1936) Linear 
Discriminant Analysis (LDA) is a variance preserving approach for finding 
a linear discriminant function. 
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Fig. 5.1. Linear discriminant analysis based on class centers alone is not sufficient. 
Seefcing a projection which maximizes the distance between the projected centers 
will prefer the horizontal axis over the vertical, yet the two classes overlap on the 
horizontal axis. The projected distance along the vertical axis is smaller yet the 
classes are better separated. The conclusion is that the sample variance of the two 
classes must be taken into consideration as well. 

We will then introduce another popular statistical technique called Canon- 
ical Correlation Analysis (CCA) for learning the mapping between input and 
output vectors using the notion "angle" between subspaces. 

What is common in the three techniques PCA, LDA and CCA is the use 
of spectral matrix analysis — i.e., what can you do with eigenvalues and 
eigenvectors of matrices representing subspaces of the data? These tech- 
niques produce optimal results for normally distributed data and are very 
easy to implement. There is a large variety of uses of spectral analysis in 
statistical and learning literature including spectral clustering, Multi Di- 
mensional Scaling (MDS) and data modeling in general. 

To appreciate the general idea behind Fisher's LDA consider Fig. |5.1[ Let 
the centers of classes one and two be denoted by /xi and ^2 respectively. A 
linear discriminant function is a projection onto a ID subspace such that 
the classes would be separated the most in the ID subspace. The obvious 
first step in this kind of analysis is to make sure that the projected centers 
/Ui,/i2 would be separated as much as possible. We can easily see that the 
direction of the ID subspace should be proportional to — fi2 as follows: 



The right-hand term is maximized when w ~ ^ui — /i2- As illustrated in 
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Fig. 5.1 this type of consideration is not sufficient to capture separability 
in the projected subspace because the spread (variance) of the data points 
around their centers also play an important role. For example, the horizontal 
axis in the figure separates the centers better than the vertical axis but on 
the other hand does a worse job in separating the classes themselves because 
of the way the data points are spread around their centers. The argument 
in favor of separating the centers would work if the data points were living 
in a hyper-sphere around the centers, but will not be sufficient otherwise. 

The basic idea behind Fisher's LDA is to consider the sample covariance 
matrix of the individual classes as well as their centers, in the following way. 
The optimal ID projection would that which maximizes the variance of the 
projected centers while minimizes the variance of the projected data points 
of each class separately. Mathematically, this idea can be implemented by 
maximizes the following ratio: 

w sl + si 

where si is the scaled variance of the projected points of the ffist class: 

XiSCi 

and likewise, 

X,GC2 

where x = j^^Xj + b. 

We will now formalize this approach and derive its solution. We will begin 
with a general description of a multiclass problem where the sample data 
points belong to q different classes, and later focus on the case of g = 2. 



5.6 Fisher's LDA: General Derivation 

Let the sample data points S be members of q classes Ci, ...,Cq where the 
number of points belonging to class Ci is denoted by li and the total number 
of the training set is I = k. Let Hj denote the center of class Ci and fi 
denote the center of the complete training set S: 



^ = 7 



Xie5 
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Let Aj be the matrix associated with class Cj whose columns consists of the 
mean shifted data points: 

Aj = [xi - jjLj, - Hj] Xj e Cj. 

Then, j-AjAj is the covariance matrix associated with class Cj. Let Sw 
(where " w" stands for "within") be the sum of the class covariance matrices: 

T 



Prom the discussion in the previous section, it is p^w^S'^w which we 
wish to minimize. To see why this is so, note 



Let B be the matrix holding the class centers: 

B= [fli - H,...,Hq- jj], 

and let 5';, = ^BB^ (where "b" stands for "between"). From the discussion 
above it is p^^pW^iS^w = Yliif^i ~ A)^ which we wish to maximize. Taken 
together, we wish to maximize the ratio (called "Rayleigh's quotient"): 

max J(w) = — =p-; . 

The necessary condition for optimality is: 

dJ _ S\,'w{mv^ Sw^^) — 5^vif(w'''S'(,w) _ Q 
dw (w'''S'u,w)2 ' 

Prom which we obtain the generalized eigensystem: 

^ftw = J(w)5^w. (5.2) 

That is, w is the leading eigenvector of S^^Si, (assuming Syj is invertible). 
The general case of finding q such axes involves finding the leading general- 
ized eigenvectors of {Sh, S^) — the derivation is out of scope of this lecture. 
Note that since S~^Sh is not symmetric there may be no real- value solution, 
which is a complication will not pursue further in this course. Instead we 
will focus now on the 2-class {q = 2) setting below. 
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5.7 Fisher's LDA: 2-class 

The general derivation is simplified when there are only two classes. The 
covariance matrix BB^ becomes a rank-1 matrix: 

BB^ = {fli - - fl)^ + (/X2 - ^)(^2 - m)^ = (/^i - ^2)(mi - ^2)^- 

As a result, BB^w is a vector in direction ^1 — ^2- Therefore, the solution 



for w from eqn. 5.2 



is: 

The decision boundary w^(x — /i) = becomes: 

x^S~\fii - /i2) - ^{fJ-i + /i2)^5'^^(w - /^2) = 0. (5.3) 

This decision boundary will surface again in the course when we consider 
Bayseian inference. It will be shown that this decision boundary is the 
Maximum Likelihood solution in the case where the two classes are normally 
distributed with means 1^1,^2 and with the same covariance matrix Syj. 



5.8 LDA versus SVM 

Both LDA and SVM search for a so called "optimal" linear discriminant 
function, what is the difference? The heart of the matter lies in the definition 
of what constitutes a sufficient compact representation of the data. In LDA 
the assumption is that each class can be represented by its mean vector and 
its spread (i.e., covariance matrix). This is true for normally distributed 
data — but not true in general. This means that we should expect that 
LDA will produce the optimal discriminant linear function when each of the 
classes are normally distributed. 

With SVM, on the other hand, there is no assumption on how the data 
is distributed. Instead, the emerging result is that the data is represented 
by the subset of data points which lie on the boundary between the two 
classes (the so called support vectors). Rather than making a parametric 
assumption on how the data can be captured (i.e., mean and covariance) the 
theory shows that the data can be captured by a special subset of points. The 
tools, as a result, are naturally more complex (quadratic linear programming 
versus spectral matrix analysis) — but the advantage is that optimality is 
guaranteed without making assumptions on the distribution of the data (i.e., 
distribution free). It can be shown that SVM and LDA would produce the 
same result if the class data is normally distributed. 
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5.9 Canonical Correlation Analysis 

CCA is a technique for learning a mapping /(x) = y where x G i?'^ and 
y e W using the notion of subspace similarity (an extension of the inner 
product between two vectors) from a training set of (xj , yj , i = 1 , . . . , n. Such 
a mapping, where y can be any point in as opposed to a discrete set of 
labels, is often referred to as a "regression" (as opposed to "classification"). 

Like in PCA and LDA, the approach would be to look for projection axes 
such that the projection of the input and output vectors on those axes satisfy 
certain requirements — and like PCA and LDA the tools we would be using 
is matrix spectral analysis. 

It will be convenient to stack our vectors as rows of an input matrix A and 
output matrix B. Let ^4 be an n x matrix whose rows are x^^, ...,x^ and 
B is the n X s matrix whose rows are yj ,...,y^. Consider vectors u G i?*^ 
and V G -R^ and project the input and output data onto them producing 
Au = (x^u, ...,xju) and B^r. The requirement we would like to place on 
the projection axes is that Au ~ i?v, or in other words that (Au)^ {B^^) 
is maximal. The requirement therefore is that the projection of the input 
points onto the u axis is similar to the projection of the output points 
onto the v axis. If wc extend this notion to multiple axes Ui,...,Ug (not 
necessarily orthogonal) and vi,...,Vg where q < min(A;,,s) our requirement 
becomes that the new coordinates of the input points projected onto the 
subspace spanned by the u vectors are similar to the new coordinates of 
the output points projected onto the subspace spanned by the v vectors. 
In other words, wc wish to find two (/-dimensional subspaces one of and 
the other of i?* such that the two sets of projected points are as aligned as 
possible. 

CCA goes a step further and makes the assumption that the input /output 
relationship is solely determined by the relation (angles) between the column 
spaces of A, B. In other words, the particular columns of A are not really 
important, what is important is the space Ua spanned by the columns. 
Since g = Au is a point in Ua (a linear combination of the columns of 
A) and h = Bv is a point in Ub, then g^h is the cosine angle, cos{(j)) 
between the two axes provided that we normalize the vectors g and h. If 
we continue this line of reasoning recursively, we obtain a set of angles 
< 9i < ... < 9q < (7r/2), called "principal angles", between the two 
subspaces uniquely defined as: 



cosiOA = max max g h 



(5.4) 
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subject to: 

gTg = hTh = l, h^h, = 0,g^g, = 0, i = l,...,j-l 

As a result, we obtain the following optimization function over axes u, v: 

maxu'^^'^Sv s.t. lUull^ = 1, \\Bvf = 1. 
u,v 

To solve this problem we first perform a "QR" factorization of A and B. A 
"QR" factorization of a matrix A is a Grahm- Schmidt process resulting in 
an orthonormal set of vectors arranged as the columns of a matrix Qa whose 
column space is equal to the column space of A, and a matrix Ra which 
contains the coefficients of the linear combination of the columns of Qa such 
that A = QaRa- Since orthoganilzation is not unique, the Grahm-Schmidt 
process perfroms the orthogonalization such that Ra is an upper-diagonal 
matrix. Likewise let B = QbRb- Because the column spaces of A and Qa 
are the same, then for every u there exists a u such that Au = Qa^- Our 
optimization problem now becomes: 

maxii^Q^QB'v s.t. ||u|p = 1, ||v|p = 1. 
u,v 

The solution of this problem is when u and v are the leading singular vectors 
of Q\Qb- The singular value decomposition (SVD) of any matrix E is a 
decomposition E = UDV^ where the columns of U are the leading eigen- 
vectors of EE'^ , the rows of are the leading eigenvectors of E'^ E and 
D is a diagonal matrix whose entries are the corresponding square eigen- 
values (note that the eigenvalues of EE'^ and E'^ E are the same). The 
SVD decomposition has the property that if we keep only the first q leading 
eigenvectors then UDV^ is the closest (in least squares sense) rank q matrix 
to E. 

Therefore, let UDV'^ be the SVD of QJQb using the first q eigenvectors. 
Then, our sought after axes U = [ui,...,Ug] is simply R^U and likewise 
and the axes V = [vi, Vg] is equal to R~^V . The axes are called "canon- 
ical vectors", and the vectors gj = (mutually orthogonal) are called 
"variatcs". The concept of principal angles is due to Jordan in 1875, where 
Hotclling in 1936 is the first to introduce the recursive definition above. 

Given a new vector x G i?'^ the resulting vector y can be found by solving 
the linear system [/^x = F^y (since our assumption is that in the new basis 
the coordinates of x and y are similar). 

To conclude, the relationship between A and B is captured by creating 
similar variates, i.e., creating subspaces of dimension q such that the projec- 
tions of the input vectors and the output vectors have similar coordinates. 
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The process for obtaining the two q'-dimensional subspaces is by performing 

a QR factorization of A and B followed by an SVD. Here again the spectral 
analysis of the input and output data matrices plays a pivoting role in the 
input / output association. 



6 
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In the previous lecture we ended up with the formulation: 

max trace{G^KG) s.t. G = I (6.1) 

and showed the solution G is the leading eigenvectors of the symmetric pos- 
itive semi definite matrix K. When K = AA^ (sample covariance matrix) 
with A = [xi,...,Xm], Xj G i?", those eigenvectors form a basis to a k- 
dimensional subspace of i?" which is the closest (in L2 norm sense) to the 
sample points Xj. The axes (called principal axes) gi,...,gfc preserve the 
variance of the original data in the sense that the projection of the data 
points on the gi has maximum variance, projection on g2 has the maximum 
variance over all vectors orthogonal to g^, etc. The spectral decomposition 
of the sample covariance matrix is a way to " compress" the data by means 
of linear super-position of the original coordinates y = G^x. 
We also ended with a ratio formulation: 

w'''S'iw 

max — 

W w ' D2W 

where ^i, 5*2 where scatter matrices defined such that w^S'iw is the variance 
of class centers (which we wish to maximize) and w^S'2W is the sum of 
within class variance (which we want to minimize). The solution w is the 
generalized eigenvector 5*1 w = AS'2W with maximal A. 

In this lecture we will show additional applications where the search for 
leading eigenvectors plays a pivotal part of the solution. So far we have 
seen how spectral analysis relates to PCA and LDA and today we will fo- 
cus on the classic Data Clustering problem of partitioning a set of points 
xi,...,Xm into k >2 classes, i.e., generating as output indicator variables 
2/1, Um where yi G {1, A;}. We will begin with "K-means" algorithm for 
clustering and then move on to show how the optimization criteria relates 
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to grapth-theoretic approaches (like Min-Cut, Ratio-Cut, Normalized Cuts) 
and spectral decomposition. 



6.1 K-means Algorithm for Clustering 

The K-means formulation (originally introduced by assumes that the 
clusters are defined by the distance of the points to their class centers only. 
In other words, the goal of clustering is to find those k mean vectors ci, 
and provide the cluster assignment Ui G {1, ...,k} of each point Xj in the set. 
The K-means algorithm is based on an interleaving approach where the 
cluster assignments yi are established given the centers and the centers are 
computed given the assignments. The optimization criterion is as follows: 

k 

min >^ >^ llx,' — c, |P (6.2) 
S/l,...,y„,Ci,...,Cfe^ " * ^" ^ ' 

j=i yi=j 

Assume that ci, are given from the previous iteration, then 

yi = argmin ||xi - Cj|p, 
j 

and next assume that yi, ..,ym (cluster assignments) are given, then for any 
set S C { 1 , . . . , m} we have that 

^ I^Xj = argmin ^ ||xj - cf. 

In other words, given the estimated centers in the current round, the new 
assignments are computed by the closest center to each point Xj, and then 
given the updated assignments the new centers are estimated by taking the 
mean of each cluster. Since each step is guaranteed to reduce the optimiza- 
tion energy the process must converge — to some local optimum. 

The drawback of the K-means algorithm is that the quality of the local 
optimum strongly depends on the initial guess (either the centers or the 
assignments). If we start with a wild guess for the centers it would be fairly 
unlikely that the process would converge to a good local minimum (i.e. one 
that is close to the global optimum). An alternative approach would be to 
define an approximate but simpler problem which has a closed form solution 
(such as obtained by computing eigenvectors of some matrix). The global 
optimum of the K-means is an NP-Complete problem (mentioned briefly in 
the next section). 

Next, we will rewrite the K-means optimization criterion in matrix form 



and see that it relates to the spectral formulation (eqn. 6.1) 
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6.1.1 Matrix Formulation of K-means 



We rewrite eqn. 6.2 as follows [7]. Instead of carrying the class variables yi 
we define class sets where V'i C {!,..., n} with UV'i = 

and ipi^^ipj =0. The K-means optimization criterion seeks for the centers 
and the class sets: 

k 

min > y llxj — cJP. 
V'i,...,Vfc,Ci,...,Cfe^ ^ " 

j=i i&^j 

Let = IV'jl and following the expansion of the squared norm and dropping 
xJ^Xj we end up with an equivalent problem: 



k k 



min > ; ljc]cj - 2 ^ ^ x/c^ 



Next we substitute Cj with its definition: X^jg^. and obtain a new 

equivalent formulation where the centers Cj are eliminated form considera- 
tion: 

k ^ 

min - — xjxs 

which is more conveniently written as a maximization problem: 

k ^ 

Since the resulting formulation involves only inner-products we could have 



replaced Xj with 0(xj) in eqn. 6.2 where the mapping 0(-) is chosen such 



that 0(xj)'^(/)(xj) can be replaced by some non-linear function K(xj,Xj) — 
known as the "kernel trick" (discussed in previous lectures). Having the 
ability to map the input vectors onto some high- dimensional space before 
K-means is applied provides more flexibility and increases our chances of 
getting out a "good" clustering from the global K-means solution (again, 
the local optimum depends on the initial conditions so it could be "bad"). 
The RBF kernel is quite popular in this context K(xi,Xj) = e"""'^'""'^^" 
with a some pre-determined parameter. Note that fi;(xj,Xj) G (0,1] which 
can be interpreted loosely as the probability of Xj and Xj to be clustered 
together. 

Let Kij = K(xj,Xj) making K a m x m symmetric positive-semi-definite 
matrix often referred to as the "affinity" matrix. Let F be an n x n matrix 
whose entries are Fij = if £ '<pr for some class and Fij = 
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otherwise. In other words, if we sort the points Xj according to cluster 
membership, then F is a block diagonal matrix with blocks Fi, ...,Fk where 
Fr = {l/lr)'i-'L^ is an Ir X Ij. block of I's scaled by l/lr- Then, Eqn. 6.3 can 
be written in terms of K as follows: 

n 

max ^ KijFij = trace{KF) (6.4) 

In order to form this as an optimization problem we need to represent the 
structure of F in terms of constraints. Let G be an n x fc column-scaled 
indicator matrix: Gij = (l/y^) if i E ipj (i.e., Xj belongs to the j'th class) 
and Gij = otherwise. Let gj^, be the columns of G and it can be easily 
verified that g^Sr^ = diag{0, .., Fr,0, ..,0) therefore F = J^iSiSJ = GG^ . 



Since trace{AB) = trace{BA) we can now write eqn. 6.4 in terms of G: 



m.axtrace{G KG) 

under conditions on G which we need to further spell out. 

We will start with the necessary conditions. Clearly G > (has non- 
negative entries). Because each point belongs to exactly one cluster we must 
have G'^Gij = when i / j and G'^Gu = {l/k)l'^l = 1, thus G^G = I. 
Furthermore we have that the rows and columns oi F = GG^ sum up to 
1, i.e., Fl = 1,F~^1 = 1 which means that F is doubly stochastic which 
translates to the constraint GG^l = 1 on G. We have therefore three 
necessary conditions on G: (i) G > 0, (ii) G^G = /, and (iii) GG^l = 1. 
The claim below asserts that these are also sufficient conditions: 

Claim 4 The feasibility set of matrices G which satisfy the three conditions 
G > 0, GG'^l = 1 andG'^G = I are of the form: 

I otherwise J 

Proof: From G > and gjg^ = we have that GirGis = 0, i.e., G 
has a single non-vanishing element in each row. It will be convenient to 
assume that the points are sorted according to the class membership, thus 
the columns of G have the non-vanishing entries in consecutive order and 
let Ij be the number of non- vanishing entries in column g^. Let Uj the 
vector of entries holding only the non- vanishing entries of gj. Then, 
the doubly stochastic constraint GG^l = 1 results that (l^Uj)uj = 1 for 
J = l,...,/c. Multiplying 1 from both sides yields (l~''uj)^ = 1^1 = Ij, 
therefore = [] 
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This completes the equivalence between the matrix formulation: 

max trace{G^KG) s.t. G > 0, G^G = I, GG^l = 1 (6.5) 



and the original K-means formulation of eqn. 6.2 

We have obtained the same optimization criteria as eqn. 6.1 with addi- 
tional two constraints: G should be non-negative and GG^ should be doubly 
stochastic. The constraint G^G = I comes from the requirement that each 
point is assigned to one class only. The doubly stochastic constraint comes 
from a "class balancing" requirement which we will expand on below. 



6.2 Min-Cut 



We will arrive to eqn. 6.5 from a graph-theoretic perspective. We start 
with representing the graph Min-Cut problem in matrix form, as follows. 
A convenient way to represent the data to be clustered is by an undirected 
graph with edge- weights where ^ = {1, m} is the vertex set, E G V x V 
is the edge set and k : E ^ is the positive weight function. Vertices 
of the graph correspond to data points Xj, edges represent neighborhood 
relationships, and edge-weights represent the similarity (affinity) between 
pairs of linked vertices. The weight adjacency matrix K holds the weights 
where Kij = for (i,j) G E and Kij = otherwise. 

A cut in the graph is defined between two disjoint sets A, B C V , A U 
B = V, is the sum of edge-weights connecting the two sets: cut{A, B) = 
X^jgyl which is a measure of dissimilarity between the two sets. The 

Min-Cut problem is to find a minimal weight cut in the graph (can be solved 
in polynomial time through Max Network Flow solution). The following 
claim associates algebraic conditions on G with an indicator matrix: 

Claim 5 The feasibility set of matrices G which satisfy the three conditions 
G > 0, Gl = 1 and G^ G = D for some diagonal matrix D are of the form: 



Gi 



1 Xi^ Ipj 

'■^ 10 otherwise 



Proof: Let G = [g^, g/^.]. From G > and gjgg = we have that 
GirGis = 0, i.e., G has a single non-vanishing element in each row. From 
Gl = 1 the single non-vanishing entry of each row must have the value of 

In the case of two classes (A; = 2), the function tr{G^ KG) is equal to 
Z](ij)eVi ^'■i + ^{i,j)(iip2 ^v- Therefore maxGtr{G'^ KG) is equivalent to 
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minimizing the cut: Ylieipi je-ti>2 ^^j' ^ result, the Min-Cut problem is 
equivalent to solving the optimization problem: 

max triG^KG) s.t G > 0, Gl = 1, G~^ G = diag (6.6) 

We seem to be close to eqn. |6.5| with the difference that G is orthogonal 
(instead of orthonormal) and the doubly-stochasitc constraint is replaced 
by Gl = 1. The difference can be bridged by considering a "balancing" 
requirement. Min-Cut can produce an unbalanced partition where one set 
of vertices is very large and the other contains a spurious set of vertices 
having a small number of edges to the larger set. This is an undesirable 
outcome in the context of clustering. Consider a "balancing" constraint 
G^l = {m/k)l which makes a strict requirement that all the k clusters have 
an equal number of points. We can relax the balancing constraint slightly by 
combining the balancing constraint with Gl = 1 into one single constraint 
GG^l = {m/k)l, i.e., GG^ is scaled doubly stochastic. Note that the two 
conditions GG^ 1 = {m/k)l and G^ G = D result m. D = {m/k)I. Thus we 
propose the relaxed-balanced hard clustering scheme: 

m&^triG^KG) s.t G > 0, GG^l = ^1, G^G = ^/ 
G ~ k k 

The scale m/k is a global scale that can be dropped without affecting the 
resulting solution, thus the Min-Cut with a relaxed balancing requirement 
becomes eqn. |6.5| which we saw is equivalent to K-means: 

maxtr(G^A'G) s.t G > 0, GG^l = 1, G^G = I. 



6.3 Spectral Clustering: Ratio-Cuts and Normalized-Cuts 

We saw above that the doubly-stochastic constraint has to do with a "bal- 
ancing" desire. A further relaxation of the balancing desire is to perform the 
optimization in two steps: (i) replace the affinity matrix K with the closest 
(under some chosen error measure) doubly-stochastic matrix K' , (ii) find a 
solution to the problem: 

max triG'^K'G) s.t G > 0, G^G = / (6.7) 

because GG'^ should come out close to K' {tr{G^K'G) = tr{K'GG^)) 
and K' is doubly-stochastic, then GG^ should come out close to satisfying 
a doubly-stochastic constraint — this is the motivation behind the 2-step 
approach. Moreover, we drop the non-negativity constraint G > 0. Note 
that the non-negativity constraint is crucial for the physical interpretation of 
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G; nevertheless, for k = 2 clusters it is possible to make an interpretation, as 
we shall next. As a result we are left with a spectral decomposition problem 



of eqn. 6.1 



max tr{G^K'G) s.t G'^ G = I, 

where the columns of G are the leading eigenvectors of K'. We will refer 
to the first step as a "normalization" process and there are two popular 
normalizations in the literature — one leading to Ratio-Cuts and the other 
to Normalized- Cuts. 



6.3.1 Ratio- Cuts 

Let D = diag{Kl) which is a diagonal matrix containing the row sums of 
K. The Ratio-Cuts normalization is to look for K' as the closest doubly- 
stochastic matrix to K by minimizing the Li norm — this turns out to be 
K' = K -D + I. 

Claim 6 (ratio-cut) Let K he a symmetric positive-semi-definite whose 
values are in the range [0, 1]. The closest doubly stochastic matrix K' under 
the Li error norm is 

K' = K -D + I 

Proof: Let r = mini;' ||/C-F||i s.t. Fl = 1, F = F^. Since ||i^- F||i > 
\\{K — F)l||i for any matrix F, we must have: 

r > ||(ir-F)l||i = pl-llli = \\D-I\\i. 

Let F = K - D + I, then 

\\K - {K - D + = \\D-I\\i. 

D 

The Laplacian matrix of a graph is D — K. If v is an eigenvector of the 
Laplacian D — K with eigenvalue A, then v is also an eigenvector of K' = 
K — D + I with eigenvalue 1 — A and since {D — K)l = then the smallest 
eigenvector v = 1 of the Laplacian is the largest of K' , and the second 
smallest eigenvector of the Laplacian (the ratio-cut result) corresponds to the 
second largest eigenvector of K' . Because the eigenvectors are orthogonal, 
the second eigenvector must have positive and negative entries (because the 
inner-product with 1 is zero) — thus the sign of the entries of the second 
eigenvector determines the class membership. 
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Ratio-Cuts, the second smallest eigenvector of the Laplacian D — K, is 
an approximation due to Hall in the 70s [2] to the Min-Cut formulation. 
Let z G R"^ determine the class membership such that Xj and xj would be 
clustered together if Zi and zj have similar values. This leads to the following 
optimization problem: 

min - ^^{zi — Zj)'^Kij s.t. z^z = 1 

The criterion function is equal to (l/2)z^ {D — K)z and the derivative of the 
Lagrangian {l/2)z'^ {D — K)z — \{zJ z — 1) with respect to z gives rise to 
the necessary condition {D — K)z = Xz and the Ratio-Cut scheme follows. 



6.3.2 Normalized- Cuts 

Normalized-Cuts looks for the closest doubly-stochastic matrix K' in relative 
entropy error measure defined as: 

RE{x 1 1 y) = In — - 

i i i 

We will encounter the relative entropy measure in more detail later in the 
course. We can show that K' must have the form AKA for some diagonal 
matrix A: 

Claim 7 The closest doubly-stochastic matrix F under the relative-entropy 
error measure to a given non-negative symmetric matrix K, i.e., which min- 
imizes: 

min RE{F\\K) s.t. F>0, F = F'^ , Fl = 1, F^l = 1 

F 

has the form F = AKA for some (unique) diagonal matrix A. 
Proof: The Lagrangian of the problem is: 

The derivative with respect to fij is: 
dL 



In fij + 1 - Inkij - I - \i - /Uj = 



dfij 

from which we obtain: 

fij — 6 ' e^-' kij 
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Let Di = diag{e^^, e'*'") and D2 = diag{e^^, e^"), then we have: 

F = D1KD2 

Since = and K is symmetric we must have D\ = L>2- [] 

Next, we can show that the diagonal matrix A can found by an iterative 
process where K is replaced by D^^/^KD^^^^ where D was defined above 
as diag{Kl): 

Claim 8 For any non-negative symmetric matrix K^'^\ iterating the process 
K^t+i) ^ jj-i/2j^{t)j^-i/2 yj^^i^ ^ diag{K(^h) converges to a doubly 

stochastic matrix. 

The proof is based on showing that the permanent increases monotonically, 
i.e. perm(K^^^^^) > perm{K^^^). Because the permanent is bounded the 
process must converge and if the permanent does not change (at the conver- 
gence point) the resulting matrix must be doubly stochastic. The resulting 
doubly stochastic matrix is the closest to K in relative-entropy. 

Normalized-Cuts takes the result of the first iteration by replacing K 
with K' = D~^^^KD~^^'^ followed by the spectral decomposition (in case 
of fc = 2 classes the partitioning information is found in the second leading 
eigenvector of K' — just like Ratio-Cuts but with a different K'). Thus, K' 
in this manner is not the closest doubly-stochastic matrix to K but is fairly 
close (the first iteration is the dominant one in the process). 

Normalized-Cuts, as the second leading eigenvector of i^T' = D~^^'^KD^^^'^, 
is an approximation to a "balanced" Min-Cut described first in [6j. Deriving 
it from first principles proceeds as follows: 

Let sum{Vi,V2) = sumi^y^^j^y^Kij be defined for any two subsets (not 
necessarily disjoint) of vertices. The normalized-cuts measures the cut cost 
as a fraction of the total edge connections to all the nodes in the graph: 

, , cut(A,B) cut(A,B) 
Ncuts{A,B) = \ ' ; + ^ ' ^ 



sum{A, V) sum(B, V) 

A minimal Ncut partition will no longer favor small isolated points since the 
cut value would most likely be a large percentage of the total connections 
from that small set to all the other vertices. A related measure Nassoc{A, B) 
defined as: 

/ X sumiA.A) sum(B,B) 

Nassoc{A,B) = + 7^^, 

sum{A,V) sum{B,V) 

reflects how tightly on average nodes within the group are connected to each 
other. Given that cut{A, B) = sum{A, V) — sum(A, A) one can easily verify 
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that: 

Ncuts{A, B) = 2- Nassoc{A, B), 

therefore the optimal bi-partition can be represented as maximizing Nassoc{A, V- 
A). The Nassoc naturahy extends to k > 2 classes (partitions) as follows: 
Let ■01 , • • • , V'fc be disjoint sets Ujipj = V, then: 

k 

sum{tpj, V) 

We will now rewrite Nassoc in matrix form and establish equivalence to 
eqn. 



Nassoc{'ifji,...,'ipk) = 



6.7 



LetG = [gi,...,gfc] withg^. = l/^sum{i;j,V){0,...,0,l,...l,0.,,,0) 

with the Is indicating membership to the j'th class. Note that 

gT^g. ^ sum{'>Pj,iPj) 
•' sum{ipj, V) ' 

therefore trace{G~^ KG) = Nassoc{'ipi, ...,ipk)- Note also that gj Dg^ = 
(l/sum(0i,y)) dr = 1, therefore DG = I. Let G = D^I'^G so we 

have that G'^G = / and trace{G'^ D'^l'^KD-^/'^G) = Nassoc{ipi, ...,ipk)- 
Taken together we have that maximizing Nassoc is equivalent to: 

max trace{G'^K'G) s.t. G > 0, G^G = /, (6.8) 
where K' = D-^/'^KD-^/'^. Note that this is exactly the K -means matrix 



setup of eqn. 6.5 where the doubly-stochastic constraint is relaxed into the 
replacement of K by K'. The constraint G > is then dropped and the 
resulting solution for G is the k leading eigenvectors of K' . 



We have arrived via seemingly different paths to eqn. 6.8 which after we 
drop the constraint G > we end up with a closed form solution consisting 
of the k leading eigenvectors of K'. When k = 2 (two classes) one can 
easily verify that the partitioning information is fully contained in the second 
eigenvector. Let vi,V2 be the first leading eigenvectors of K'. Clearly 
V = is an eigenvector with eigenvalue A = 1: 

In fact A = 1 is the largest eigenvalue (left as an exercise) thus vi = D^/^\ > 
0. Since K' is symmetric the vjvi = thus V2 contains positive and 
negative entries — those are interpreted as indicating class membership 
(positive to one class and negative to the other). 

The case A: > 2 is treated as an embedding (also known as Multi-Dimensional 
Scaling) by re-coordinating the points Xj using the rows of G. In other 



68 Spectral Analysis II: Clustering 

words, the i'th row of G is a representation of Xj in R^. Under ideal con- 
ditions where K is block diagonal (the distance between clusters is infinity) 
the rows associated with points clustered together are identical (i.e., the n 
original points are mapped to k points in R^) |5]. In practice, one performs 
the iterative K-means in the embedded space. 
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The Formal (PAC) Learning Model 



We have see so far algorithms that exphcitly estimate the underlying dis- 
tribution of the data (Bayesian methods and EM) and algorithms that are 
in some sense optimal when the underlying distribution is Gaussian (PCA, 
LDA). We have also encountered an algorithm (SVM) that made no as- 
sumptions on the underlying distribution and instead tied the accuracy to 
the margin of the training data. 

In this lecture and in the remainder of the course we will address the 
issue of "accuracy" and "generalization" in a more formal manner. Because 
the learner receives only a finite training sample, the learning function can 
do very well on the training set yet perform badly on new input instances. 
What we would like to establish are certain guarantees on the accuracy of 
the learner measured over all the instance space and not only on the training 
set. We will then use those guarantees to better understand what the large- 
margin principle of SVM is doing in the context of generalization. 

In the remainder of this lecture we will refer to the following notations: 
the class of learning functions is denoted by C. A learning functions is often 
referred to as a "concept" or "hypothesis". A target function G C is a 
function that has zero error on all input instances (such a function may not 
always exist). 



7.1 The Formal Model 

In many learning situations of interest, we would like to assume that the 
learner receives m examples sampled by some fixed (yet unknown) distri- 
bution D and the learner must do its best with the training set in order to 
achieve the accuracy and confidence objectives. The Probably Approximate 
Correct (PAC) model, also known as the "formal model", first introduced by 
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Valient in 1984, provides a probabilistic setting which formalizes the notions 
of accuracy and confidence. 

The PAC model makes the following statistical assumption. We assume 
the learner receives a set 5 of m instances xi, G X which are sampled 
randomly and independently according to a distribution D over X. In other 
words, a random training set S of length m is distributed according to the 
product probability distribution D™. The distribution D is unknown, but 
we will see that one can obtain useful results by simply assuming that D is 
fixed — there is no need to attempt to recover D during the learning process. 
To recap, we make the following three assumptions: (i) D is unkown, (ii) 
D is fixed throughout the learning process, and (iii) the example instances 
are sampled independently of each other (are Identically and Independently 
Distributed — i.i.d.). 

We distinguish between the "realizable" case where a target concept cj(x) 
is known to exist, and the unrealizable case, where there is no such guar- 
antee. In the realizable case our training examples are Z = {(xj, ct(xj)}, 
i = 1, ...,m and D is defined over X (since yi & Y are given by ct(xj)). In 
the unrealizable case, Z = {(xj,yi)} and D is the distribution over X xY 
(each element is a pair, one from X and the other from Y). 

We next define what is meant by the error induced by a concept function 
/i(x). In the realizable case, given a function h E C, the error of h is defined 
with respect to the distribution D: 

err{h) = pro6/j[x : ct(x) ^ ^(x)] = / ind{ct{'x.) ^ /i(x))D(x)dx 

Jxex 

where ind{F) is an indication function which returns '1' if the proposition 
F is true and '0' otherwise. The function err{h) is the probability that 
an instance x sampled according to D will be labeled incorrectly by /i(x). 

Let e > be a parameter given to the learner specifying the "accuracy" of 
the learning process, i.e. we would like to achieve err{h) < e. Note that 
err{ct) = 0. 

In addition, we define a "confidence" parameter 5 > 0, also given to the 
learner, which defines the probability that err{h) > e, namely, 

prob[err{h) > e] < S, 

or equivalently: 

prob[err(h) < e] > 1 — S. 

In other words, the learner is supposed to meet some accuracy criteria but 
is allowed to deviate from it by some small probability. Finally, the learning 
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algorithm is supposed to be "efficient" if the running time is polynomial in 
l/e,ln(l/5),n and the size of the concept target function q() (measured by 
the number of bits necessary for describing it, for example). 

We will say that an algorithm L learns a concept family C in the formal 
sense (PAC learnable) if for any ct £ C and for every distribution D on the 
instance space X, the algorithm L generates efficiently a concept function 
h £ C such that the probability that err{h) < e is at least 1 — 6. 

The inclusion of the confidence value 5 could seem at first unnatural. 
What we desire from the learner is to demonstrate a consistent performance 
regardless of the training sample Z. In other words, it is not enough that 
the learner produces a hypothesis h whose accuracy is above threshold, i.e., 
err{h) < e, for some training sample Z. We would like the accuracy per- 
formance to hold under all training samples (sampled from the distribution 
Z)"*) — since this requirement could be too difficult to satisfy, the formal 
model allows for some "failures", i.e, situations where err{h) > e, for some 
training samples Z, as long as those failures are rare and the frequency of 
their occurrence is controlled (the parameter 6) and can be as small as we 
hke. 

In the unrealizable case, there may be no function h £ C for which 
err(h) = 0, thus we need to define what we mean by the best a learning 
algorithm can achieve: 

Opt{C) = min err(/i), 

which is the best that can be done on the concept class C using functions 
that map between X and Y. Given the desired accuracy e and confidence 6 
values the learner seeks a hypothesis h £ C such that: 

prob[err{h) < Opt{C) + e]>l-6. 

We are ready now to formalize the discussion above and introduce the defi- 
nition of the formal learning model (Anthony &; Bartlett [1], pp. 16): 

Definition 1 (Formal Model) Let C he the concept class of functions that 
map from, a set X to Y. A learning algorithm L is a function: 



L : [J{(x.,y.)}I^i-C 



m=l 



from the set of all training examples to C with the following property: given 
any e,5 £ (0,1) there is an integer mo(e,(5) such that if m > uiq then, for 
any prohahility distrihution D on X x Y , if Z is a training set of length m 
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drawn randomly according to the product probability distribution D'^, then 
with probability of at least 1 — 6 the hypothesis h = L{Z) G C output by 
L is such that err{h) < Opt{C) + e. We say that C is learnable (or PAC 
learnable) if there is a learning algorithm for C. 

There are few points to emphasize. The sample size mo(e, 5) is a suf- 
ficient sample size for PAC learning C by L and is allowed to vary with 
e, 5. Decreasing the value of either e oi 8 makes the learning problem more 
difficult and in turn a larger sample size is required. Note however that 
mo(e, 5) does not depend on the distribution D\ that is, a sufficient sample 
size can be given that will work for any distribution D — provided that 
D is fixed throughout the learning experience (both training and later for 
testing). This point is a crucial property of the formal model because if the 
sufficient sample size is allowed to vary with the distribution D then not 
only we would need to have some information about the distribution in or- 
der to set the sample complexity bounds, but also an adversary (supplying 
the training set) could control the rate of convergence of L to a solution 
(even if that solution can be proven to be optimal) and make it arbitrarily 
slow by suitable choice of D. 

What makes the formal model work in a distribution-invariant manner 
is that it critically depends on the fact that in many interesting learning 
scenarios the concept class C is not too complex. For example, we will show 
later in the lecture that any finite concept class \C\ < oo is learnable, and 
the sample complexity (in the realizable case) is 

e 

In the next lecture we will consider concept classes of infinite size and show 
that despite the fact that the class is infinite it still can be of low complexity! 

Before we illustrate the concepts above with an example, there is another 
useful measure which is the empirical error (also known as the sample error) 
err{h) which is defined as the proportion of examples from Z on which h 
made a mistake: 

err{h) = ■ H^i) + ct(xi)}| 

(replace ct(xj) with for the unrealizable case). The situation of bounding 
the true error err(/i) by minimizing the sample error efr{h) is very conve- 
nient — we will get to that later. 
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As an illustration of learnability we will consider the problem (introduced 
in Kearns & Vazirani [3] ) of learning an axes-aligned rectangle from positive 
and negative examples. We will show that the problem is PAC-learnable 
and find out mo(e, 5). 

In the rectangle learning game we are given a training set consisting of 
points in the 2D plane with a positive '+' or negative '-' label. The positive 
examples are sampled inside the target rectangle (parallel to the main axes) 
R and the negative examples are sampled outside of R. Given m examples 
sampled i.i.d according to some distribution D the learner is supposed to 
generate an approximate rectangle R' which is consistent with the training 
set (we are assuming that R exists) and which satisfies the accuracy and 
confidence constraints. 

We first need to decide on a learning strategy. Since the solution R' 
is not uniquely defined given any training set Z, we need to add further 
constraints to guarantee a unique solution. We will choose R' as the axes- 
aligned concept which gives the tightest fit to the positive examples, i.e., the 
smallest area axes-aligned rectangle which contains the positive examples. 
If no positive examples are given then R' = 9. We can also assume that 
Z contains at least three non-coUinear positive examples in order to avoid 
complications associated with infinitesimal area rectangles. Note that we 
could have chosen other strategies, such as the middle ground between the 
tightest fit to the positive examples and the tightest fit (from below) to the 
negative examples, and so forth. Defining a strategy is necessary for the 
analysis below — the type of strategy is not critical though. 

We next define the error err{R') on the concept R' generated by our 
learning strategy. We first note that with the strategy defined above we 
always have R' <Z R since R' is the tightest fit solution which is consistent 
with the sample data (there could be a positive example outside of R' which 
is not in the training set). We will define the "weight" w{E) of a region E 
in the plane as 



i.e., the probability that a random point sampled according to the distri- 
bution D will fall into the region. Therefore, the error associated with the 
concept R' is 




err(i?') = w{R - R') 
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R' 








R 



Fig. 7.1. Given the tightest- fit to positive examples strategy we have that R' C R. 
The strip Ti has weight e/4 and the strip T{ is defined as the upper strip covering 
the area between R and R' . 



and we wish to bound the error w{R — R') < e with probability of at least 
1 — 5 after seeing m examples. 

We will divide the region R — R' into four strips T[,...,T'^ (see Fig, 7.1 1 
which overlap at the corners. We will estimate prob{w(Tl) > |) noting that 
the overlaps between the regions makes our estimates more pessimistic than 
they truly are (since we are counting the overlapping regions twice) thus 
making us lean towards the conservative side in our estimations. 

Consider the upper strip T[. If tf(T{ < |) then we are done. We are 
however interested in quantifying the probability that this is not the case. 
Assume w{T{) > | and define a strip Ti which starts from the upper axis 
of R and stretches to the extent such that w{Ti) = |. Clearly Ti C T[. We 
have that wiT^) > | iff Ti C T{. Furthermore: 



Claim 9 Ti C r{ iff^i, ...,^rn T^. 



Proof: If Xj G Ti the the label must be positive since Ti C R. But if 
the label is positive then given our learning strategy of fitting the tightest 
rectangle over the positive examples, then Xj G R'. Since Ti {Z! R' it follows 
that Xj ^ Ti. [] 

We have therefore that w{T[ > |) iff no point in Ti appears in the sample 
S = {xi, ...,Xm} (otherwise Ti intersects with R' and thus T{ C Ti). The 
probability that a point sampled according to the distribution D will fall 
outside of Ti is 1 — f • Given the independence assumption (examples are 
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drawn i.i.d.), we have: 

pro6(xi, Ti) = prob{w{T[ > ^)) = (1 - J)™. 

Repeating the same analysis to regions T2 , Tg , T4 and using the union bound 
P(A L) B) < P{A) + P{B) we come to the conclusion that the probability 
that any of the four strips oi R — R' has weight greater that e/4 is at most 
4(1 — 1)"*. In other words, 

prob{err{L') > e) < 4((1 - < 5. 

We can make the expression more convenient for manipulation by using the 
inequality > 1 — x (recall that 1 + (l/n))" < e from which it follows 
that (1 + z)^/^ < e and by taking the power of rz where r > we obtain 
(1 + zY < eF^ then set r = 1, z = — x): 

4(1 - < 46-"? < 5, 

from which we obtain the bound: 

4, 4 
m > - In - . 
e d 

To conclude, assuming that the learner adopts the tightest-fit to positive ex- 
amples strategy and is given at least mo = | In j training examples in order 
to find the axes-aligned rectangle i?', we can assert that with probability 
1 — (5 the error associated with R! (i.e., the probability that an (m -|- l)'th 
point will be classified incorrectly) is at most e. 

We can see form the analysis above that indeed it applies to any distri- 
bution D where the only assumption wc had to make is the independence 
of the draw. Also, the sample size m behaves well in the sense that if one 
desires a higher level of accuracy (smaller e) or a higher level of confidence 
(smaller S) then the sample size grows accordingly. The growth of m is 
linear in 1/e and linear in ln(l/(5). 



7.3 Learnability of Finite Concept Classes 

In the previous section wc illustrated the concept of learnability with a 
particular simple example. We will now focus on applying the learnability 
model to a more general family of learning examples. We will consider the 
family of all learning problems over finite concept classes \C\ < 00. For 

example, the conjunction learning problem (over boolean formulas) with n 
literals contains only 3"" hypotheses because each variable can appear in 
the conjunction or not and if appears it could be negated or not. We have 
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shown that n is the lower bound on the number of mistakes on the worst 
case analysis any on-line algorithm can achieve. With the definitions we 
have above on the formal model of Icarnability we can perform accuracy 
and sample complexity analysis that will apply to any learning problem 
over finite concept classes. This was first introduced by Valiant in 1984. 

In the realizable case over |C| < oo, we will show that any algorithm L 
which returns a hypothesis h C which is consistent with the training set 
Z IS a learning algorithm for C. In other words, any finite concept class 
is learnable and the learning algorithms simply need to generate consistent 
hypotheses. The sample complexity mo associated with the choice of e and 
S can be shown as equal to: Mn 

In the unrealizable case, any algorithm L that generates a hypothesis 
h € C that minimizes the empirical error (the error obtained on Z) is a 
learning algorithm for C. The sample complexity can be shown as equal to: 
4 In We will derive these two cases below. 



7.3.1 The Realizable Case 

Let h e C he some consistent hypothesis with the training set Z (we know 
that such a hypothesis exists, in particular h = ct the target concept used 
for generating Z) and suppose that 

err{h) = prob[x. ~ D : /i(x) 7^ ct(x)] > e. 

Then, the probability (with respect to the product distribution D"^) that h 
agrees with q on a random sample of length m is at most (1 — e)"*. Using 
the inequality we saw before > 1 — x we have: 

prob[err{h) > e kk h{-}^i) = ct(xi), i = 1, m] < (1 - e)™ < e'^"^. 

We wish to bound the error uniformly, i.e., that err{h) < e for all concepts 
h e C. This requires the evaluation of: 

prob[msLX.{err{h) > e} /i(xj) = ct(xj), i = l,...,m\. 



There at most \C\ such functions h, therefore using the Union-Bound the 
probability that some function in C has error larger than e and is consistent 
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with ct on a random sample of length m is at most |C|e~^™: 

prob[3h : err{h) > e && /i(xj) = cj(xj), i = 1, m] 

< ^ pro6[/i(xi) = Q(xi), i = l,...,m] 

h:err(h)>e 

< \h : err(/i) > ele"'™ 

For any positive 5, this probability is less than 8 provided: 

1. |C| 
m > - In — — . 

e 

This derivation can be summarized in the following theorem (Anthony & 
Bartlett p^, pp. 25): 

Theorem 5 Let C he a finite set of functions from X to Y . Let L he 
an algorithm such that for any m and for any a G C, if Z is a training 
sample {(xj, Ci(xj))}, i = l,...,m, then the hypothesis h = L{Z) satisfies 
h(xi) = ct(xj). Then L is a learning algorithm for C in the realizahle case 
with sample complexity 

mo = - In — — . 
e 

7.3.2 The Unrealizable Case 

In the realizable case an algorithm simply needs to generate a consistent 
hypothesize to be considered a learning algorithm in the formal sense. In 
the unrealizable situation (a target function q might not exist) an algorithm 
which minimizes the empirical error, i.e., an algorithm L generates h = L{Z) 
having minimal sample error: 

err{LiZ)) = minerrih) 

is a learning algorithm for C (assuming finite |C|). This is a particularly 
useful property given that the true errors of the functions in C are un- 
known. It seems natural to use the sample errors err{h) as estimates to the 
performance of L. 

The fact that given a large enough sample (training set Z) then the sam- 
ple error err{h) becomes close to the true error err{h) is somewhat of a 
restatement of the "law of large numbers" of probability theory. For ex- 
ample, if we toss a coin many times then the relative frequency of 'heads' 
approaches the true probability of 'head' at a rate determined by the law of 
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large numbers. We can bound the probability that the difference between 
the empirical error and the true error of some h exceeds e using Hoeffding's 
inequality: 

Claim 10 Let h he some function from X to Y = {0, 1}. Then 

prob[\err{h) - err{h)\ > e] < 2e^~'^^^"'\ 

for any probability distribution D, any e > and any positive integer m. 

Proof: This is a straightforward application of Hoeffding's inequality to 
Bernoulli variables. Hoeffding's inequality says: Let X be a set, D a proba- 
bility distribution on X, and /i, /m real- valued functions fi : X ^ [ui, hi] 
from X to an interval on the real line (oj < hi). Then, 



prob 



1 



m 



^/,(x,)-£;x~D[/(x)]|>e 



< 2e ^iV'i-'^ir 



(7.1) 



where 



^x~d[/(x)] = -f; / /^(x)D(x)dx. 



Therefore 



In our case /i(xj) = 1 iff /i(xj) ^ yi and aj = 0,6^ = 
(l/m)X;i/i(xi) = err{h) and err{h) = -Bx~d[/(x)]. |] 

The Hoeffding bound almost does what we need, but not quite so. What 
we have is that for any given hypothesis h G C, the empirical error is 
close to the true error with high probability. Recall that our goal is to 
minimize err{h) over all possible h E C but we can access only err{h). If 
we can guarantee that the two arc close to each other for every h G C, then 
minimizing err{h) over all /i € C will approximately minimize err{h). Put 
formally, in order to ensure that L learns the class C, we must show that 



prob 



max|efr(/i) — erriK)\ < e 
heC ' ^ ^ ^ ^' 



> l-<5 



In other words, we need to show that the empirical errors converge (at high 
probability) to the true errors uniformly over C as m — ^ oo. If that can be 
guaranteed, then with (high) probability 1 — S, for every h E C, 



err{h) — e < err{h) < err{h) + e. 

So, since the algorithm L running on training set Z returns h 
minimizes the empirical error, we have: 



L{Z) which 



err{L{Z)) < err{L{Z)) + e = min efr(/i) + e< Opt{C) + 2e, 

h 
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which is what is needed in order that L learns C. Thus, what is left is to 
prove the following claim: 



Claim 11 



prob 



max \ err(h) — err(h)\ > e 
heC 



< 2\C\e 



Proof: We will use the union bound. Finding the maximum over C is 
equivalent to taking the union of all the events: 



prob 



max I err (/i) — err(h)\ > e 
h&c 



prob 



\^ {Z : \err(h) — err(h)\ > e} 



using the union-bound and Claim 2, we have: 

< ^pro6[|efr(/i) -err(/i)| > e] < \C\2e'^-^'^"'\ 
hec 

D 

Finally, given that 2|C|e '^'^ < 6 we obtain the sample complexity: 

2 2\C\ 
mo = ^In^^. 

This discussion is summarized with the following theorem (Anthony &; Bartlett 
[l],pp. 21): 

Theorem 6 Let C be a finite set of functions from X to Y = {0, 1}. Let L 
be an algorithm such that for any m and for any training set Z = {(xj, yj)}, 
i = 1, ...,m, then the hypothesis L(Z) satisfies: 

err(L(Z)) = min efr(h). 

Then L is a learning algorithm for C with sample complexity mo = ^ In 

Note that the main difference with the realizable case (Theorem 1) is the 
larger rather than 1/e. The realizable case requires a smaller training 
set since we are estimating a random quantity so the smaller the variance 
the less data we need. 
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The result of the PAC model (also known as the "formal" learning model) 
is that if the concept class C is PAC-learnable then the learning strategy 
must simply consist of gathering a sufficiently large training sample S of 
size m, > mo(e,5), for given accuracy e > and confidence < 5 < 1 
parameters, and finds a hypothesis h ^ C which is consistent with S. The 
learning algorithm is then guaranteed to have a bounded error err{h) < e 
with probability 1 — 5. The error measurement includes data not seen by 
the training phase. 

This state of affair also holds (with some slight modifications on the sam- 
ple complexity bounds) when there is no consistent hypothesis (the unreal- 
izable case) . In this case the learner simply needs to minimize the empirical 
error err{h) on the sample training data S, and if m is sufficiently large 
then the learner is guaranteed to have err{h) < Opt{C) + e with probability 
1 — 5. The measure Opt{C) is defined as the minimal err{g) over all g & C. 
Note that in the realizable case Opt{C) = 0. 

The property of bounding the true error err{h) by minimizing the sample 
error err{h) is very convenient. The fundamental question is under what 
conditions this type of generalization property applies? We saw in the previ- 
ous lecture that a satisfactorily answer can be provided when the cardinality 
of the concept space is bounded, i.e. |C| < oo, which happens for Boolean 
concept space for example. In that lecture we have proven that: 

mo(e,(5) = 0(-lnM), 
e 

is sufficient for guaranteeing a learning model in the formal sense, i.e., which 
has the generalization property described above. 

In this lecture and the one that follows we have two goals in mind. First 
is to generalize the result of finite concept class cardinality to infinite car- 
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dinality — note that the bound above is not meaningful when \C\ = oo. 
Can we learn in the formal sense any non-trivial infinite concept class? (we 
already saw an example of a PAC-learnable infinite concept class which is 
the class of axes aligned rectangles). In order to answer this question we will 
need to a general measure of concept class complexity which will replace the 
cardinality term \ C\ in the sample complexity bound nio^e, 6). It is tempting 
to assume that the number of parameters which fully describe the concepts 
of C can serve as such a measure, but we will show that in fact one needs 
a more powerful measure called the Vapnik-Chervonenkis (VC) dimension. 
Our second goal is to pave the way and provide the theoretical foundation 
for the large margin principle algorithm (SVM) we derived in Lecture El 



8.1 The VC Dimension 

The basic principle behind the VC dimension measure is that although C 
may have infinite cardinality, the restriction of the application of concepts 
in C to a finite sample S has a finite outcome. This outcome is typically 
governed by an exponential growth with the size m of the sample S — but 
not always. The point at which the growth stops being exponential is when 
the "complexity" of the concept class C has exhausted itself, in a manner 
of speaking. 

We will assume C is a concept class over the instance space X — both 
of which can be infinite. We also assume that the concept class maps in- 
stances in X to {0, 1}, i.e., the input instances are mapped to "positive" 
or "negative" labels. A training sample S is drawn i.i.d according to some 
fixed but unknown distribution D and S consists of m instances xi, ...,Xm. 
In our notations we will try to reserve c G C to denote the target concept 
and h £ C to denote some concept. We begin with the following definition: 

Definition 2 

nc(S) = {(/i(xi),...,/i(x„) : heC} 
which is a set of vectors in {0, 1}™". 

Ilc{S) is set whose members are m-dimensional Boolean vectors induced by 
functions of C. These members are often called dichotomies or behaviors on 
S induced or realized by C. If C makes a full realization then He (5) will 
have 2"* members. An equivalent description is a collection of subsets of S: 

Uc{S) = {hnS : heC} 

where each h £ C makes a partition of S into two sets — the positive and 
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negative points. The set Iic{S) contains therefore subsets of S (the positive 

points of S under /;,). A full realization will provide (T) ~ ^'"^ 

will use both descriptions of Ilc{S) as a collection of subsets of S and as a 
set of vectors interchangeably. 

Definition 3 // |nc(S')| = 2"^ then S is considered shattered by C. In 
other words, S is shattered by C if C realizes all possible dichotomies of S. 

Consider as an example a finite concept class C = {ci, ...,04} applied to 
three instance vectors with the results: 

Xl X2 X3 



Cl 1 1 1 

C2 1 1 

C3 1 

C4 



Then, 

nc({xi}) = {(0), (1)} shattered 

nc({xi,X3}) = {(0,0),(0,1),(1,0),(1,1)} shattered 

nc({x2,X3}) = {(0,0), (1, 1)} not shattered 

With these definitions we are ready to describe the measure of concept class 
complexity. 

Definition 4 (VC dimension) The VC dimension ofC, noted as VCdim{C), 
is the cardinality d of the largest set S shattered by C. If all sets S (arbi- 
trarily large) can be shattered by C , then VCdim{C) = 00. 

VCdim{C) = max{d | 3\S\ = d, and \Uc{S)\ = 2*^} 

The VC dimension of a class of functions C is the point d at which all 
samples S with cardinality |5| > d are no longer shattered by C. As long as 
C shatters S it manifests its full "richness" in the sense that one can obtain 
from S all possible results (dichotomies). Once that ceases to hold, i.e., 
when |5| > d, it means that C has "exhausted" its richness (complexity). 
An infinite VC dimension means that C maintains full richness for all sample 
sizes. Therefore, the VC dimension is a combinatorial measure of a function 
class complexity. 

Before we consider a number of examples of geometric concept classes 
and their VC dimension, it is important clarify the lower and upper bounds 
(existential and universal quantifiers) in the definition of VC dimension. 
The VC dimension is at least d if there exists some sample |5| = d which is 
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shattered by C — this does not mean that all samples of size d are shattered 
by C. Conversely, in order to show that the VC dimension is at most d, 
one must show that no sample of size d + 1 is shattered. Naturally, proving 
an upper bound is more difficult than proving the lower bound on the VC 
dimension. The following examples are shown in a "hand waiving" style 
and are not meant to form rigorous proofs of the stated bounds — they are 
shown for illustrative purposes only. 

Intervals of the real line: The concept class C is governed by two param- 
eters «!, a2 in the closed interval [0, 1]. A concept from this class will tag an 
input instance < x < 1 as positive if ai < x < 0:2 and negative otherwise. 
The VC dimension is at least 2: select a sample of 2 points xi,X2 positioned 
in the open interval (0, 1). We need to show that there are values of ai, a2 
which realize all the possible four dichotomies (+, +), (— , — ), (+, —),(—,+). 
This is clearly possible as one can place the interval [q;i,q;2] such the inter- 
section with the interval [xi, X2] is null, (thus producing (— , — )), or to fully 
include [xi,X2] (thus producing (-I-, +)) or to partially intersect [.xi,.'r2] such 
that xi or X2 are excluded (thus producing the remaining two dichotomies). 
To show that the VC dimension is at most 2, we need to show that any 
sample of three points xi, X2, X3 on the line (0, 1) cannot be shattered. It is 
sufficient to show that one of the dichotomies is not realizable: the labeling 
(-|-,— ,-|-) cannot be realizable by any interval [q:i,q;2] — this is because if 
xi,X3 are labeled positive then by definition the interval [01,02] must fully 
include the interval [xi,X3] and since xi < X2 < X3 then X2 must be labeled 
positive as well. Thus VCdim{C) = 2. 

Axes-aligned rectangles in the plane: We have seen this concept class 
in the previous lecture — a point in the plane is labeled positive if it lies 
in an axes-aligned rectangle. The concept class C is thus governed by 4 
parameters. The VC dimension is at least 4: consider a configuration of 
4 input points arranged in a cross pattern (recall that we need only to 
show some sample S that can be shattered). We can place the rectangles 
(concepts of the class C) such that all 16 dichotomies can be realized (for 
example, placing the rectangle to include the vertical pair of points and 
exclude the horizontal pair of points would induce the labeling (-I-, — , -|-, — )). 
It is important to note that in this case, not all configurations of 4 points 
can be shattered — but to prove a lower bound it is sufficient to show 
the existence of a single shattered set of 4 points. To show that the VC 
dimension is at most 4, we need to prove that any set of 5 points cannot 
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be shattered. For any set of 5 points there must be some point that is 
"internal", i.e., is neither the extreme left, right, top or bottom point of the 
five. If we label this internal point as negative and the remaining 4 points as 
positive then there is no axes-aligned rectangle (concept) which cold realize 
this labeling (because if the external 4 points are labeled positive then they 
must be fully within the concept rectangle, but then the internal point must 
also be included in the rectangle and thus labeled positive as well). 

Separating hyperplanes: Consider first linear half spaces in the plane. 
The lower bound on the VC dimension is 3 since any three (non-coUinear) 
points in can be shattered, i.e., all 8 possible labelings of the three points 
can be realized by placing a separating line appropriately. By having one 
of the points on one side of the line and the other two on the other side we 
can realize 3 dichotomies and by placing the line such that all three points 
are on the same side will realize the 4th. The remaining 4 dichotomies are 
realized by a sign flip of the four previous cases. To show that the upper 
bound is also 3, we need to show that no set of 4 points can be shattered. 
We consider two cases: (i) the four points form a convex region, i.e., lie on 
the convex hull defined by the 4 points, (ii) three of the 4 points define the 
convex hull and the 4th point is internal. In the first case, the labeling which 
is positive for one diagonal pair and negative to the other pair cannot be 
realized by a separating line. In the second labeling which is positive 

for the three hull points and negative for the interior point cannot be realize. 
Thus, the VC dimension is 3 and in general the VC dimension for separating 
hyperplanes in i?" is n + 1 . 

Union of a finite number of intervals on the line: This is an example 
of a concept class with an infinite VC dimension. For any sample of points 
on the line, one can place a sufficient number of intervals to realize any 
labeling. 

The examples so far were simple enough that one might get the wrong 
impression that there is a correlation between the number of parameters 
required to describe concepts of the class and the VC dimension. As a 
counter example, consider the two parameter concept class: 

C = {sign{sin{u>x + 0) : u} 

which has an infinite VC dimension as one can show that for every set 
of m points on the line one can realize all possible labelings by choosing 
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a sufficiently large value of uj (which serves as the frequency of the sync 
function) and appropriate phase. 

We conclude this section with the following claim: 

Theorem 7 The VC dimension of a finite concept class \C\ < oo is hounded 
from above: 

VCdim{C) <log2|C|. 

Proof: if VCdim{C) = d then there exists at least 2'^ functions in C 
because every function induces a labeling and there are at least 2'^ labelings. 
Thus, ftom |C| > 2'^ follows that d < logs D 

8.2 The Relation betw^een VC dimension and PAC Learning 

We saw that the VC dimension is a combinatorial measure of concept class 
complexity and we would like to have it replace the cardinality term in the 
sample complexity bound. The first result of interest is to show that if 
the VC dimension of the concept class is infinite then the class is not PAC 
learnable. 

Theorem 8 Concept class C with VCdim{C) = oo is not learnable in the 
formal sense. 

Proof: Assume the contrary that C is PAC learnable. Let L be the learning 
algorithm and m be the number of training examples required to learn the 
concept class with accuracy e = 0.1 and 1 — S = 0.9. That is, after seeing 
at least m(e, 6) training examples, the learner generates a concept h which 
satisfies p{err{h) < 0.1) > 0.9. 

Since the VC dimension is infinite there exist a sample set S with 2m 
instances which is shattered by C. Since the formal model (PAC) applies 
to any training sample we will use the set S as follows. We will define a 
probability distribution on the instance space X which is uniform on S (with 
probability ^) and zero everywhere else. 

Because S is shattered, then any target concept is possible so we will 
choose our target concept c in the following manner: 

prob{ct{-Ki) = 0) = ^ Vxi G S, 

in other words, the labels ct(x.j) are determined by a coin flip. The learner 
L selects an i.i.d. sample of m instances S — which due to the structure of 
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D means that the S C S and outputs a consistent hypothesis h e C. The 
probabihty of error for each Xj ^ 5 is: 

prob{ct{xi) / /i(xi)) = ^. 

The reason for that is because S is shattered by C, i.e., we can select any 
target concept for any labeling of S (the 2m examples) therefore we could 
select the labels of the m points not seen by the learner arbitrarily (by 
flipping a coin). Regardless of h, the probability of mistake is 0.5. The 
expectation on the error of h is: 

E\err(h)] = m ■ ■ \- m ■ - ■ — = -. 

^ ^ 2m 2 2m A 

This is because we have 2m points to sample (according to D as all other 
points have zero probability) from which the error on half of them is zero 
(as h is consistent on the training set S) and the error on the remaining half 
is 0.5. Thus, the average error is 0.25. Note that E[err{h)] = 0.25 for any 
choice of e, S as it is based on the sample size m. For any sample size m we 
can follow the construction above and generate the learning problem such 
that if the learner produces a consistent hypothesis the expectation of the 
error will be 0.25. 

The result that E[err{h)] = 0.25 is not possible for the accuracy and 
confidence values we have set: with probability of at least 0.9 we have that 
err{h) < 0.1 and with probability 0.1 then err{h) = (3 where 0.1 < /3 < 1. 
Taking the worst case of /? = 1 we come up with the average error: 

E[err{h)\ < 0.9 • 0.1 + 0.1 • 1 = 0.19 < 0.25. 

We have therefore arrived to a contradiction that C is PAC learnablc. [] 

We next obtain a bound on the growth of |n5(C)| when the sample size 
l^l = m is much larger than the VC dimension VCdim{C) = d oi the 
concept class. We will need few more definitions: 



Definition 5 (Growth function) 

Uc{m) = m8oc{\Us{C)\ : \S\ = m} 

The measure Ilc{Tn) is the maximum number of dichotomies induced by C 

for samples of size rn. As long as m < d then Ilc{m) = 2"*. The question 
is what happens to the growth pattern of Ilc{m) when m > d. Wc will 
see that the growth becomes polynomial — a fact which is crucial for the 
learnability of C. 



8.2 The Relation between VC dimension and PAC Learning 



87 



Definition 6 For any natural numbers m, d we have the following definition: 

$d(m) = ^d{m-l) + ^d-i{m-l) 
$rf(0) = $oM = l 

By induction on m, d it is possible to prove the following: 
Theorem 9 

Proof: by induction on m, d. For details see [P], pp. 56].[] 

For m < d we have that ^diiTi) = 2™. For m > d we can derive a 
polynomial upper bound as follows. 

(lYy M <y (IX M <y (IX M = (^1 + 1^^ 

From which we obtain: 

(-)\d{m)<e''. 
Dividing both sides by (^)'^ yields: 

We need one more result before we are ready to present the main result of 
this lecture: 

Theorem 10 (Sauer's lemma) If VCdim{C) = d, then for any m, 
nc(m) < ^d{m). 

Proof: By induction on both d,m. For details see [[3], pp. 55-56] .[] 

Taken together, we have now a fairly interesting characterization on how 
the combinatorial measure of complexity of the concept class C scales up 
with the sample size m. When the VC dimension of C is infinite the growth 
is exponential, i.e., nc(m) = 2™ for all values of m. On the other hand, 
when the concept class has a bounded VC dimension VCdim{C) = d < oo 
then the growth pattern undergoes a discontinuity from an exponential to 
a polynomial growth: 

f 2™ m<d ) 
\<{^) m>d j 
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As a direct result of this observation, when m >> d is much larger than d 

the entropy becomes much smaller than m. Recall than from an informa- 
tion theoretic perspective, the entropy of a random variable Z with discrete 
values zi, ...,Zn with probabilities pi, i = 1, ...,n is defined as: 



where I{pi) = log2 — is a measure of "information", i.e., is large when 
Pi is small (meaning that there is much information in the occurrence of 
an unlikely event) and vanishes when the event is certain pi = 1. The 
entropy is therefore the expectation of information. Entropy is maximal for 
a uniform distribution H{Z) = log2 n. The entropy in information theory 
context can be viewed as the number of bits required for coding zi, z^- In 
coding theory it can be shown that the entropy of a distribution provides the 
lower bound on the average length of any possible encoding of a uniquely 
decodable code fro which one symbol goes into one symbol. When the 
distribution is uniform we will need the maximal number of bits, i.e., one 
cannot compress the data. In the case of concept class C with VC dimension 
d, we see that one when m < d all possible dichotomies are realized and 
thus one will need m bits (as there are 2"* dichotomies) for representing 
all the outcomes of the sample. However, when m » d only a small 
fraction of the 2™ dichotomies can be realized, therefore the distribution 
of outcomes is highly non-uniform and thus one would need much less bits 
for coding the outcomes of the sample. The technical results which follow 
are therefore a formal way of expressing in a rigorous manner this simple 
truth — If it is possible to compress, then it is possible to learn. The crucial 
point is that learnability is a direct consequence of the "phase transition" 
(from exponential to polynomial) in the growth of the number of dichotomies 
realized by the concept class. 

In the next lecture we will continue to prove the "double sampling" the- 
orem which derives the sample size complexity as a function of the VC 
dimension. 
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The Double-Sampling Theorem 



In this lecture will use the measure of VC dimension, which is a combina- 
torial measure of concept class complexity, to bound the sample size com- 
plexity. 

9.1 A Polynomial Bound on the Sample Size m for PAC Learning 

In this section we will follow the material presented in Kearns & Vazirani 
[3] pp. 57-61 and prove the following: 

Theorem 11 (Double Sampling) Let C be any concept class of VC di- 
mension d. Let L he any algorithm, that when given a set S of m labeled 
examples {xj,c(xj)}i, sampled i.i.d according to some fixed hut unknown 
distribution D over the instance space X, of some concept c £ C, produces 
as output a concept h £ C that is consistent with S. Then L is a learning 
algorithm in the formal sense provided that the sample size obeys: 



for some constant cq > 0. 

The idea behind the proof is to build an " approximate" concept space which 
includes concepts arranged such that the distance between the approximate 
concepts h and the target concept c is at least e — where distance is de- 
fined as the weight of the region in X which is in conflict with the target 
concept. To formalize this story we will need few more definitions. Unless 
specified otherwise, c S C denotes the target concept and h £ C denotes 
some concept. 




log - - log 
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Definition 7 

cAh = hAc = {x : c(x) 7^ /i(x)} 

cAh is the region in instance space where both concepts do not agree — 
the error region. The probabihty that x G cAh is equal to (by definition) 
err{h). 

Definition 8 

A(c) = {hAc : heC} 
Ag(c) = {hAc : heC and err{h) > e} 

A(c) is a set of error regions, one per concept h £ C over all concepts. The 
error regions are with respect to the target concept. The set Ag(c) C A(c) 
is the set of all error regions whose weight exceeds e. Recall that weight is 
defined as the probability that a point sampled according to D will hit the 
region. 

It will be important for later to evaluate the VC dimension of A(c). Unlike 
C, we are not looking for the VC dimension of a class of function but the 
VC dimension of a set of regions in space. Recall the definition of Ilc{S) 
from the previous lecture: there were two equivalent definitions one based 
on a set of vectors each representing a labeling of the instances of S induced 
by some concept. The second, yet equivalent, definition is based on a set 
of subsets of S each induced by some concept (where the concept divides 
the sample points of S into positive and negative labeled points). So far it 
was convenient to work with the first definition, but for evaluating the VC 
dimension of A(c) it will be useful to consider the second definition: 

n^ic){S) = {rnS : re A(c)}, 

that is, the collection of subsets of S induced by intersections with regions 
of A(c). An intersection between S and a region r is defined as the subset of 
points from S that fall into r. We can easily show that the VC dimensions 
of C and A(c) are equal: 

Lemma 1 

VCdim{C) = VCdim{A{c)). 

Proof: we have that the elements of Ilc{S) and n^(c)(S') are susbsets of 

S, thus we need to show that for every S the cardinality of both sets is 
equal |n(7(S')| = |n^(-,,-)(S')|. To do that it is sufficient to show that for every 
element s G 11^(5') there is a unique corresponding element in n^(£.)(S'). Let 
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cn 5 be the subset of S induced by the target concept c. The set s (a subset 

of S) is realized by some concept h (those points in S which were labeled 
positive by h). Therefore, the set s H (c H 5) is the subset of S containing 
the points that hit the region hAc which is an element of n^(-g-)(5'). Since 
this is a one-to-one mapping we have that |nc(5')| = |n^(-(,)(5)|. [] 

Definition 9 (e-net) For every e > 0, a sample set S is an e-net for A(c) 
if every region in A.e{c) is hit by at least one point of S: 

VreAe(c), Snvy^H}. 

In other words, if S hits all the error regions in A(c) whose weight exceeds 
e, then S is an e-net. Consider as an example the concept class of intervals 
on the line [0, 1]. A concept is defined by an interval [ai,a2] such that all 
points inside the interval are positive and all those outside are negative. 
Given c G C is the target concept and /i € C is some concept, then the 
error region hAc is the union of two intervals: /i consists of all points x £ h 
which are not in c, and I2 the interval of all points x G c but which are not 
in h. Assume that the distribution D is uniform (just for the sake of this 
example) then, proh{x G /) = |/| which is the length of the interval I. As a 
result, err{h) > e if either > e/2 or II2I > e/2. The sample set 

kf 

S = {x=- : fc = 0,l,...,2/e} 

contains sample points from to 1 with increments of e/2. Therefore, every 
interval larger than e must be hit by at least one point from S and by 
definition S is an e-net. 

It is important to note that if S forms an e-net then we are guaranteed 
that err{h) < e. Let h £ C he the consistent hypothesis with S (returned 
by the learning algorithm L). Becuase h is consistent, hAc G A(c) has 
not been hit by S (recall that hAc is the error region with respect to the 
target concept c, thus if h is consistent then it agrees with c over S and 
therefore S does not hit hAc). Since S forms an e-net for A(c) we must 
have hAc A^{c) (recall that by definition S hits all error regions with 
weight larger than e). As a result, the error region hAc must have a weight 
smaller than e which means that err{h) < e. 

The conclusion is that if we can bound the probability that a random sam- 
ple S does not form an e-net for A(c), then we have bounded the probability 
that a concept h consistent with S has err{h) > e. This is the goal of the 
proof of the double-sampling theorem which we are about to prove below: 
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Proof (following Kearns & Vazirani [3j pp. 59 61): Let Si be a ran- 
dom sample of size m (sampled i.i.d. according to the unknown distribution 
D) and let A be the event that Si does not form an e-net for A(c). From 
the preceding discussion our goal is to upper bound the probability for A to 
occur, i.e., prob{A) < 6. 

If A occurs, i.e., 5*1 is not an e-net, then by definition there must be some 
region r £ Ae(c) which is not hit by 5i, that is S*! H r = 0. Note that 
r = /iA(c) for some concept h which is consistent with Si. At this point the 
space of possibilities is infinite, because the probability that we fail to hit 
/iA(c) in m random examples is at most (1 — e)™'. Thus the probability that 
we fail to hit some hAc G ^^(c) is bounded from above by |A(c)|(l — e)*" 
— which does not help us due to the fact that |A(c)| is infinite. The idea 
of the proof is to turn this into a finite space by using another sample, as 
follows. 

Let 52 be another random sample of size m. We will select m (for both 
Si and S2) to guarantee a high probability that 5*2 will hit r many times. 
In fact we wish that ^2 will hit r at least ^ with probability of at least 0.5: 

CTTl CTTl 

prob{\S2 n r| > — ) = 1 — prob{\S2 Pi r| < ^-)- 

We will use the Chernoff bound (lower tail) to obtain a bound on the right- 
hand side term. Recall that if we have m Bernoulli trials (coin tosses) 
Zi, Zm with expectation E{Zi) = p and we consider the random variable 
Z = Zi + ... + Zm with expectation E{Z) = ^ (note that /x = pm) then for 
all < "0 < 1 we have: 

prob{Z < (1 - -0)/^) < e 2 . 

Considering the sampling of m examples that form S2 as Bernoulli trials, 
we have that /i > em (since the probability that an example will hit r is at 
least e) and '0 = 0.5. We obtain therefore: 

prob{\S2 n r| < (1 — -)em) < = - 

which happens when m = |ln2 = O(^). To summarize what we have 
obtained so far, we have calculated the probability that 52 will hit r many 
times given that r was fixed using the previous sampling, i.e., given that Si 
does not form an e-net. To formalize this, let B denote the combined event 
that Si does not form an e-event and 52 hits r at least em/2 times. Then, 
we have shown that for m = 0(l/e) we have: 

prob{B/A) > ^. 
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Prom this we can calculate prob{B) : 

prob{B) = prob{B/A)prob{A) > -prob{A), 

which means that our original goal of bounding prob{A) is equivalent to 
finding a bound prob{B) < S/2 because prob{A) < 2 • prob{B) < S. The 
crucial point with the new goal is that to analyze the probability of the 
event B, we need only to consider a finite number of possibilities, namely to 
consider the regions of 

nA,(c)(5iU52) = {rn{SiUS2} : rGA,(c)}. 

This is because the occurrence of the event B is equivalent to saying that 
there is some r G nA^(c)(S'i U 5*2) such that |r| > em/2 (i.e., the region r 
is hit at least em/2 times) and 5i Pi r = 0. This is because nA^(c-)(5i U 5*2) 
contains all the subsets of Si U ^2 realized as intersections over all regions 
in A^{c). Thus even though we have an infinite number of regions we still 
have a finite number of subsets. We wish therefore to analyze the following 
probability: 

prob (r G IIa,{c) (Si U S2) : \r\ > em/2 and Si n r = 0) . 

Let S = S1US2 a random sample of 2m (note that since the sampling is i.i.d. 
it is equivalent to sampling 6*1 and 52 separately) and r satisfying |r| > em/2 
being fixed. Consider some random partitioning of S into Si and 52 and 
consider then the problem of estimating the probability that Si D r = 0. 
This problem is equivalent to the following combinatorial question: we have 
2m balls, each colored Red or Blue, with exaclty I > em/2 Red balls. We 
divide the 2m balls into groups of equal size Si and ^2 and we are interested 
in bounding the probabihty that all of the I balls fall in ^2 (that is, the 
probability that Si Dr = 9). This in turn is equivalent to first dividing the 
2m uncolored balls into and S2 groups and then randomly choose I of 
the balls to be colored Red and analyze the probability that all of the Red 
balls fall into ^2. This probability is exactly 

/m\ l-l . l-l -I 1 

Ail = TT <Y\- = - = 2-^"*/2 

prrA ii 2TO_i - ii 2 2' 
W / i=0 1=0 

This probability was evaluated for a fixed S and r. Thus, the probability that 
this occurs for some r € 11^^(^-1(5') satisfying |r| > em/2 (which is prob{B)) 
can be calculated by summing over all possible fixed r and applying the 
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union bound prob{J2i ^i) ^ J2iP^^^i^i)- 

prob{B) < |nA^(,)(5)|2-^™/2<|n^(,)(5)|2-^™/2 

from which it follows that: 

^Ai 1 1 
m = C - log - H — log - 
\e e e 

D 

Few comments are worthwhile at this point: 

(i) It is possible to show that the upper bound on the sample complexity 
m is tight by showing that the lower bound on m is ^l^d/e) (see [[3], 
pp. 62]). 

(ii) The treatment above holds also for the unrealizable case (target con- 
cept c C) with slight modifications to the bound. In this context, 
the learning algorithm L must simply minimize the sample (empiri- 
cal) error err{h) defined: 

err{h) = : h{y.i) / yi}\ Xj G S. 

m 

The generalization of the double-sampling theorem (Derroye'82) states 
that the empirical errors converge uniformly to the true errors: 



proh ( max \ errih) — err{h)\ > e ) < 4e^'^''~^^'' ^ 

\hGC ) 



2 

em 



from which it follows that 

'm = 0\ ^ log ^ + ^ log - 



Taken together, we have arrived to a fairly remarkable result. Despite the 
fact that the distribution D from which the training sample S is drawn from 
is unknown (but is known to be fixed) , the learner simply needs to minimize 
the empirical error. If the sample size m is large enough the learner is guar- 
anteed to have minimized the true errors for some accuracy and confidence 
parameters which define the sample size complexity. Equivalently, 

\Opt{C) - errih)\ ^m^oo 0. 



Not only is the convergence is independent of D but also the rate of con- 
vergence is independent (namely, it does not matter where the optimal h* 
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is located). The latter is very important because without it one could ar- 
bitrarily slow down the convergence rate by maliciously choosing D. The 
beauty of the results above is that D does not have an effect at all — one 
simply needs to choose the sample size to be large enough for the accuracy, 
confidence and VC dimension of the concept class to be learned over. 

9.2 Optimality of SVM Revisited 

In Lecture |4] we discussed the large margin principle for finding an optimal 
separating hyperplane. It is natural to ask how does the PAC theory pre- 
sented so far explains why a maximal margin hyperplane is optimal with 
regard to the formal sense of learning (i.e. to generalization from empirical 
errors to true errors)? We saw in the previous section that the sample com- 
plexity m(e, 6, d) depends also on the VC dimension of the concept class — 
which is n -|- 1 for hyperplanes in i?". Thus, another natural question that 
may certainly arise is what is the gain in employing the "kernel trick" ? For a 
fixed m, mapping the input instance space X of dimension n to some higher 
(exponentially higher) feature space might simply mean that we are compro- 
mising the accuracy and confidence of the learner (since the VC dimension 
is equal to the instance space dimension plus 1). 

Given a fixed sample size m, the best the learner can do is to minimize the 
empirical error and at the same time to try to minimize the VC dimension d 
of the concept class. The smaller d is, for a fixed m, the higher the accuracy 
and confidence of the learning algorithm. Likewise, the smaller d is, for a 
fixed accuracy and confidence values, the smaller sample size is required. 

There are two possible ways to decrease d. First is to decrease the dimen- 
sion n of the instance space X. This amounts to "feature selection" , namely 
find a subset of coordinates that are the most "relevant" to the learning 
task r perform a dimensionality reduction via PCA, for example. A second 
approach is to maximize the margin. Let the margin associated with the 
separating hyperplane h (i.e. consistent with the sample S) be 7. Let the 
input vectors x S X have a bounded norm, |x| < R. It can be shown that 
the VC dimension of the concept class of hyperplanes with margin 7 is: 

= min I , n J -|- 1 . 

Thus, if the margin is very small then the VC dimension remains n+\. As 
the margin gets larger, there comes a point where -R^/7^ < n and as a result 
the VC dimension decreases. Moreover, mapping the instance space X to 
some higher dimension feature space will not change the VC dimension as 
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long as the margin remains the same. It is expected that the margin will not 
scale down or will not scale down as rapidly as the scaling up of dimension 
from image space to feature space. 

To conclude, maximizing the margin (while minimizing the empirical er- 
ror) is advantageous as it decreases the VC dimension of the concept class 
and causes the accuracy and confidence values of the learner to be largely 
immune to dimension scaling up while employing the kernel trick. 
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AO.l Variance, Covariance, etc. 

Let X, Y be two random variables and let /(x, y) be some function on x G 
X,y E Y, and let p{x, y) be the probability of the event x and y occurring 
together. The expectation E[f{x,y)] is defined: 

xeXyeY 

. The mean, variance and covariance are defined: 

= E[x] = J2J2 ^p^^' 

X y 

% = E[Y] = ^^yp{x,y) 

X y 

al = Var[X] = E[{x-ii,f] = ^Y.(x-ii,fpix,y) 

X y 

al = Var[Y] = E[{y-^lyf] = Y,Y.^y-y.yfp{x,y) 

X y 

a:cy = Cov{XY) = E[{x- Hx)iy- IJ'y)] = ^^ix- Hx)iy- IJ'y)pix,y) 

X y 

In vector-matrix notation, let x represent the n random variables oiXi,...,Xr 
i.e., X = {xi, Xn)'^ is an instance vector and p(x) is the probability of the 
instance occurrence. Then the mean is a vector fj, and the covariance matrix 
E are defined: 

H = X xp(x) 

X€{Xi,...,X„} 

E = X(x-/x)(x-^)Tp(x) 

X 

Note that the covariance matrix E is the linear superposition of rank-1 
matrices (x — /n) (x — //)^ with coefficients p(x) . The diagonal of E containes 
the variances of the variables xi,...,Xn- For a uniform distribution and 

a sample data S consisting of rn points, let ^4 = [xi — fj,,...,Xm — //] be 
the matrix whose columns consist of the points centered around the mean: 
l^~m ^i^'^- '^^^ (sample) covariance matrix \s, E = ^AAJ . 
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AO. 2 Derivatives of Matrix Operations: Scalsir Functions of a 

Vector 

The two most important examples of a scalar function of a vector x are the 
linear form a^x and the quadratic form x^^lx for some square matrix A. 

d(x^^x) = (dx)^Ax + xTA(dx) 
= x^(A + y4^)dx 

where the derivative d{'x^ Ax) using the rule of products d{f ■ g) = (df) ■ g + 
/ ■ [dg) where g = ^x and / = x^ and noting that d^Ax) = Adx. Thus, 
-^{aJx) = aJ and J^(x'^^x)) = x'^(^ + A~^). If A is symmetric then 
J^(xTAx)) = {2Axy. 

AO. 3 Primer on Constrained Optimization 

AO. 3.1 Equality Constraints and Lagrange Multipliers 

Consider first the general optimization with equality constraints which gives 
rise to the notion of Lagrange multipliers. 

mm /(x) (0.1) 

subject to 
h(x) = 

where / : — > R and h : i?" R^ where h is a vector function [hi, hk) 
each from i?" to R. We want to derive a necessary and sufficient constraint 
for a point Xq to be a local minimum subject to the k equality constraints 
h(x) = 0. Assume that Xo is a regular point, meaning that the gradient 
vectors Vhj^x) are linearly independent. Note that Vh(xo) is a A; x n matrix 
and the null space of this matrix: 

null{Vh{xo)) = {y : Vh(xo)y = 0} 

defines the tangent plane at the point Xg. We have the following fundamental 
theorem: 

V/(x„) ± mi//(Vh(xo)) 

in other words, all vectors y spanning the tangent plane at the point Xq are 
also perpendicular to the gradient of / at Xq. 
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The sketch of the proof is as fohows. Let x(t), —a < t < a, he a smooth 
curve on the surface h(x) = 0, i.e., h(x(t)) = 0. Let Xq = x(0) and y = 
^x(O) the tangent to the curve at Xq. From the definition of tangency, the 
vector y hves in nu//(Vh(xo)), i.e., y • V/ij(x(0)) = 0, j = l,...,k. Since 
Xq = x(0) is a local extremum of /(x), then 

= |/W,))U = 5:|i^U = V/W.y. 

As a corollary of this basic theorem, the gradient vector V/(xo) G span{Vhi(xo), V/ifc(xo)}, 



I.e., 



V/(xo) + ^AiV/ii(xo) = 0, 



i=l 

where the coefficients Aj are called Lagrange Multipliers and the expression: 

/(x)+^AA(x) 



is called the Lagrangian of the optimization problem (0.1). 



A 0.3. 2 Inequality Constraints and KKT conditions 

Consider next the general constrained optimization with inequality con- 
straints (called "non-linear programming"): 

min /(x) (0.2) 

subject to 
h(x) = 
g(x) < 

where g : i?" — > W^. We will assume that the optimal solution Xq is a 
regular point which has the following meaning: Let J be the set of indices 
j such that gj{xo) = 0, then Xq is a regular point if the gradient vectors 
V/ij(xo), V(7j(xo), i = l,...,k and j £ J are linearly independent. A basic 
result (we will not prove here) is the Karush-Kuhn- Tucker (KKT) theorem: 
Let Xo be a local minimum of the problem and suppose Xq is a regular 
point. Then, there exist Ai, A^ and fii > 0, > such that: 




Fig. AO.l. Geometric interpreatation of Duality (see text). 



V/(xo)+ J^AiV/i,(xo) + ^^jV5j(xo) = 0, (0.3) 
i=i j=i 

s 

Y,f^j9jM = 0. (0.4) 

i=i 

Note that the condition ^ fj,jgj{xo) = is equivalent to the condition that 
fj,jgj{xo) = (since /i > and g(xo) < thus there sum cannot vanish 
unless each term vanishes) which in turn implies: = when gj{xo) < 0. 
The expression 

k s 

L(x, A, //) = /(x) + ^ Ai/ii(x) + ^ /ij5j(x) 
1=1 j=i 



is the Lagrangian of the problem (0.2 ) and the associated condition fijgj{xo) = 
is called the KKT condition. 

The remaining concepts we need are the "duality" and the "Lagrangian 
Dual" problem. 
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The optimization problem (0.2) is called the "Primal" problem. The La- 
grangian Dual problem is defined as: 

max 6l(A,^) (0.5) 
subject to 

IJ,>0 (0.6) 

where 

(9(A, /i) = niin{/(x) + ^ \ihi{^) + ^ ^^^^(x)}. 

i j 

Note that 6(X, /x) may assume the value — oo for some values of A, fi (thus 
to be rigorous we should have replaced "min" with "inf"). The first basic 
result is the weak duality theorem: 

Let X be a feasible solution to the primal (i.e., h(x) = 0, g(x) <0) and let 
(A, ^) be a feasible solution to the dual problem (i.e., fi > 0), then /(x) > 

The proof is immediate: 



e{X, n) = nfin{/(y) + ^ X^hi{y) + ^ /Xj5i(y)} 

* j 

i j 

< /(x) 

where the latter inequality follows from h(x) = and fJ-jdji^) < 
because jU > and g(x) < 0. As a corollary of this theorem we have: 

min{/(x) : h(x) = 0,g(x) < 0} > max{6'(A,^) : ;U > 0}. (0.7) 

X A,/i 

The next basic result is the strong duality theorem which specifies the con- 
ditions for when the inequality in (0.7) becomes equality: 

Let /(),g() be convex functions and let hO be affine, i.e., h(x) = Ax — b 
where A is a k x n matrix, then 

min{/(x) : h(x) = 0,g(x) < 0} = max{e{X, fi) : /i > 0}. 

X A,/i 

The strong duality theorem allows one to solve for the primal problem by 
first dualizing it and solving for the dual problem instead (we will see exactly 
how to do it when we return to solving the primal problem (4.3)). When 
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Fig. AO. 2. An example of duality gap arising from non-convexity (see text). 

the (convexity) conditions above do not hold we obtain 

min{/(x) : h(x) = 0, g(x) < 0} > max{6'(A, /i) : /i > 0} 

which means that the optimal solution to the dual problem provides only a 
lower bound to the primal problem — this situation is called a duality gap. 
Taken together, the "duality theorem" summarizes the discussion so far: 

Theorem 12 (Duality Theorem) In order for x.* to be an optimal Primal 
solution and (A*,/i*) to be an optimal Dual solution, it is necessary and 
sufficient that: 

(i) X* is Primal feasible, 

(ii) /i* > and fi*j = for all 5j(x*) < 0, 

(iii) X* G argmin^L{-K, \* , n*) . 

We will end this section with a geometric interpretation of duality. 



AO. 3. 4 Geometric Interpretation of Duality 

For clarity we will consider a primal problem with a single inequality con- 
straint: min{/(x) : ^(x) < 0} where g : i?" R. 

Consider the set G = {{y, z) : y = fl'(x), z = /(x)} in the {y, z) plane. The 



set G is the image of BP' under the (5, /) map (see Fig. AO.l ). The primal 
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problem is to find a point in G that has a y < with the smahest z value 
— this is the point (y*, z*) in the figure. 

In this case 9{n) = minx{/(x) + /i^(x)} which is equivalent to minimize 
z + ny over points in G. The equation z + = a represents a straight line 
with slope — /i and intercept (on z axis) a. For a given value /i, to minimize 
z + fiy over G we need to move the line z + fiy = a parallel to itself as far 
down as possible while it remains in contact with G — in other words G 
is above the line and touches it. Then, the intercept with the z axis gives 
9{fj.). The dual problem is therefore equivalent to finding the slope of the 
supporting hyperplane such that its intercept on the z axis is maximal. 

Consider the non-convex region G in Fig. AO. 2| which illustrates a duality 
gap condition. The optimal primal is the point (y*, z*) which is higher than 
the greatest intercept on the z axis achieved by a line that supports G from 
below. This is an example of a duality gap caused by the non-convexity of 
the functions f{),g{) (thereby making the set G non-convex). 
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